2023-06-23 17:26:52,793 INFO [train.py:1064] (3/4) Training started 2023-06-23 17:26:52,793 INFO [train.py:1074] (3/4) Device: cuda:3 2023-06-23 17:26:55,763 INFO [lexicon.py:168] (3/4) Loading pre-compiled data/lang_char/Linv.pt 2023-06-23 17:26:56,354 INFO [train.py:1085] (3/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'c51a0b9684442a88ee37f3ce0af686a04b66855b', 'k2-git-date': 'Mon May 1 21:38:03 2023', 'lhotse-version': '1.14.0.dev+git.0f812851.dirty', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'zipformer_wenetspeech', 'icefall-git-sha1': '63e53ba-dirty', 'icefall-git-date': 'Wed Jun 21 18:13:24 2023', 'icefall-path': '/star-kw/kangwei/code/icefall_wenetspeech', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/dev_tools/anaconda3/envs/rnnt2/lib/python3.8/site-packages/lhotse-1.14.0.dev0+git.0f812851.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-6-0423201309-7c68fd68fb-6cszs', 'IP address': '10.177.28.83'}, 'world_size': 4, 'master_port': 12536, 'tensorboard': True, 'num_epochs': 12, 'start_epoch': 6, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_L_small'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 900, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537} 2023-06-23 17:26:56,355 INFO [train.py:1087] (3/4) About to create model 2023-06-23 17:26:57,105 INFO [train.py:1091] (3/4) Number of model parameters: 32327030 2023-06-23 17:26:57,112 INFO [checkpoint.py:112] (3/4) Loading checkpoint from zipformer/exp_L_small/epoch-5.pt 2023-06-23 17:27:09,270 INFO [train.py:1106] (3/4) Using DDP 2023-06-23 17:27:09,622 INFO [train.py:1118] (3/4) Loading optimizer state dict 2023-06-23 17:27:10,137 INFO [train.py:1126] (3/4) Loading scheduler state dict 2023-06-23 17:27:10,137 INFO [asr_datamodule.py:390] (3/4) About to get train cuts 2023-06-23 17:27:10,140 INFO [asr_datamodule.py:398] (3/4) About to get dev cuts 2023-06-23 17:27:10,142 INFO [asr_datamodule.py:211] (3/4) About to get Musan cuts 2023-06-23 17:27:13,461 INFO [asr_datamodule.py:216] (3/4) Enable MUSAN 2023-06-23 17:27:13,462 INFO [asr_datamodule.py:239] (3/4) Enable SpecAugment 2023-06-23 17:27:13,462 INFO [asr_datamodule.py:240] (3/4) Time warp factor: 80 2023-06-23 17:27:13,462 INFO [asr_datamodule.py:250] (3/4) Num frame mask: 10 2023-06-23 17:27:13,462 INFO [asr_datamodule.py:263] (3/4) About to create train dataset 2023-06-23 17:27:13,463 INFO [asr_datamodule.py:289] (3/4) Using DynamicBucketingSampler. 2023-06-23 17:27:19,119 INFO [asr_datamodule.py:305] (3/4) About to create train dataloader 2023-06-23 17:27:19,120 INFO [asr_datamodule.py:336] (3/4) About to create dev dataset 2023-06-23 17:27:20,064 INFO [asr_datamodule.py:354] (3/4) About to create dev dataloader 2023-06-23 17:27:20,065 INFO [train.py:1206] (3/4) Loading grad scaler state dict 2023-06-23 17:29:33,339 INFO [train.py:996] (3/4) Epoch 6, batch 0, loss[loss=0.221, simple_loss=0.2836, pruned_loss=0.07921, over 21732.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2836, pruned_loss=0.07921, over 21732.00 frames. ], batch size: 317, lr: 5.35e-03, grad_scale: 32.0 2023-06-23 17:29:33,340 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 17:29:50,960 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2383, simple_loss=0.345, pruned_loss=0.06586, over 1796401.00 frames. 2023-06-23 17:29:50,960 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 21523MB 2023-06-23 17:29:59,749 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=15.0 2023-06-23 17:30:27,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=914898.0, ans=0.0 2023-06-23 17:30:28,635 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.741e+02 4.794e+02 6.251e+02 8.348e+02 2.118e+03, threshold=1.250e+03, percent-clipped=42.0 2023-06-23 17:31:12,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=915018.0, ans=0.125 2023-06-23 17:31:12,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=915018.0, ans=0.1 2023-06-23 17:31:35,964 INFO [train.py:996] (3/4) Epoch 6, batch 50, loss[loss=0.3025, simple_loss=0.371, pruned_loss=0.117, over 21482.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3256, pruned_loss=0.08653, over 970624.88 frames. ], batch size: 471, lr: 5.35e-03, grad_scale: 16.0 2023-06-23 17:31:48,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=915138.0, ans=0.125 2023-06-23 17:32:42,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=915318.0, ans=0.125 2023-06-23 17:33:00,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=915318.0, ans=0.0 2023-06-23 17:33:21,866 INFO [train.py:996] (3/4) Epoch 6, batch 100, loss[loss=0.2545, simple_loss=0.3335, pruned_loss=0.08779, over 21510.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3329, pruned_loss=0.08461, over 1713400.33 frames. ], batch size: 194, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:33:40,874 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-23 17:33:46,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-23 17:33:47,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=915498.0, ans=0.125 2023-06-23 17:34:04,536 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.886e+02 2.333e+02 2.600e+02 2.995e+02 4.991e+02, threshold=5.199e+02, percent-clipped=0.0 2023-06-23 17:34:11,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=915558.0, ans=0.125 2023-06-23 17:35:09,857 INFO [train.py:996] (3/4) Epoch 6, batch 150, loss[loss=0.2601, simple_loss=0.3354, pruned_loss=0.09243, over 21791.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3332, pruned_loss=0.08364, over 2276076.35 frames. ], batch size: 124, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:35:29,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=915798.0, ans=0.125 2023-06-23 17:36:43,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=915978.0, ans=0.95 2023-06-23 17:36:59,849 INFO [train.py:996] (3/4) Epoch 6, batch 200, loss[loss=0.265, simple_loss=0.3648, pruned_loss=0.08266, over 21209.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3281, pruned_loss=0.08132, over 2716151.43 frames. ], batch size: 548, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:37:10,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=15.0 2023-06-23 17:37:40,264 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.585e+02 2.985e+02 3.639e+02 6.609e+02, threshold=5.970e+02, percent-clipped=4.0 2023-06-23 17:38:47,084 INFO [train.py:996] (3/4) Epoch 6, batch 250, loss[loss=0.2215, simple_loss=0.3174, pruned_loss=0.06285, over 21666.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3237, pruned_loss=0.08004, over 3059453.49 frames. ], batch size: 414, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:38:48,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-06-23 17:38:51,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=916338.0, ans=0.0 2023-06-23 17:39:21,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=916398.0, ans=0.2 2023-06-23 17:39:36,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=916458.0, ans=0.0 2023-06-23 17:39:38,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-23 17:39:58,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=916518.0, ans=0.1 2023-06-23 17:40:18,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=916578.0, ans=0.1 2023-06-23 17:40:28,676 INFO [train.py:996] (3/4) Epoch 6, batch 300, loss[loss=0.2616, simple_loss=0.3329, pruned_loss=0.09518, over 21475.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3185, pruned_loss=0.08012, over 3320464.06 frames. ], batch size: 194, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:40:40,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=916638.0, ans=0.0 2023-06-23 17:40:57,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=15.0 2023-06-23 17:41:02,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=916698.0, ans=10.0 2023-06-23 17:41:08,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.631e+02 3.060e+02 3.627e+02 5.054e+02, threshold=6.120e+02, percent-clipped=0.0 2023-06-23 17:41:15,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=22.5 2023-06-23 17:41:24,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-23 17:41:50,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=916818.0, ans=0.0 2023-06-23 17:41:56,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=916818.0, ans=0.5 2023-06-23 17:42:13,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=22.5 2023-06-23 17:42:21,702 INFO [train.py:996] (3/4) Epoch 6, batch 350, loss[loss=0.2386, simple_loss=0.3019, pruned_loss=0.08767, over 20004.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3109, pruned_loss=0.07939, over 3532525.76 frames. ], batch size: 703, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:42:29,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=916938.0, ans=0.0 2023-06-23 17:42:46,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-23 17:43:00,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=916998.0, ans=0.125 2023-06-23 17:43:25,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=917058.0, ans=0.5 2023-06-23 17:43:29,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=917058.0, ans=0.125 2023-06-23 17:43:42,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=917118.0, ans=0.2 2023-06-23 17:43:52,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=917178.0, ans=0.125 2023-06-23 17:44:07,647 INFO [train.py:996] (3/4) Epoch 6, batch 400, loss[loss=0.2334, simple_loss=0.3333, pruned_loss=0.0667, over 21821.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3037, pruned_loss=0.0781, over 3683666.09 frames. ], batch size: 316, lr: 5.34e-03, grad_scale: 32.0 2023-06-23 17:44:18,850 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:44:47,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.687e+02 2.996e+02 3.462e+02 5.169e+02, threshold=5.992e+02, percent-clipped=0.0 2023-06-23 17:44:49,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=917358.0, ans=0.1 2023-06-23 17:45:14,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=917358.0, ans=0.09899494936611666 2023-06-23 17:45:45,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-06-23 17:45:55,316 INFO [train.py:996] (3/4) Epoch 6, batch 450, loss[loss=0.2067, simple_loss=0.3026, pruned_loss=0.0554, over 21782.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3, pruned_loss=0.07674, over 3818790.43 frames. ], batch size: 371, lr: 5.34e-03, grad_scale: 32.0 2023-06-23 17:46:05,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=917538.0, ans=0.5 2023-06-23 17:47:38,313 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:47:46,334 INFO [train.py:996] (3/4) Epoch 6, batch 500, loss[loss=0.22, simple_loss=0.2924, pruned_loss=0.07378, over 21780.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2999, pruned_loss=0.07526, over 3920754.20 frames. ], batch size: 247, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:48:11,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=917898.0, ans=0.1 2023-06-23 17:48:32,679 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.519e+02 2.896e+02 3.744e+02 5.708e+02, threshold=5.793e+02, percent-clipped=0.0 2023-06-23 17:48:56,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=918018.0, ans=0.1 2023-06-23 17:49:15,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=918078.0, ans=0.125 2023-06-23 17:49:30,899 INFO [train.py:996] (3/4) Epoch 6, batch 550, loss[loss=0.1996, simple_loss=0.2777, pruned_loss=0.06074, over 21468.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3053, pruned_loss=0.07554, over 3987439.19 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:49:37,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=918138.0, ans=0.1 2023-06-23 17:49:37,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-23 17:50:24,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=918258.0, ans=0.125 2023-06-23 17:51:15,196 INFO [train.py:996] (3/4) Epoch 6, batch 600, loss[loss=0.1849, simple_loss=0.2958, pruned_loss=0.037, over 20809.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3086, pruned_loss=0.07604, over 4046674.24 frames. ], batch size: 608, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:51:15,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=918438.0, ans=0.125 2023-06-23 17:52:12,356 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.707e+02 3.073e+02 3.854e+02 5.945e+02, threshold=6.147e+02, percent-clipped=1.0 2023-06-23 17:52:29,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-23 17:52:42,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=918618.0, ans=0.05 2023-06-23 17:52:52,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=918678.0, ans=0.125 2023-06-23 17:53:04,173 INFO [train.py:996] (3/4) Epoch 6, batch 650, loss[loss=0.2203, simple_loss=0.2921, pruned_loss=0.07427, over 21857.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3102, pruned_loss=0.07642, over 4102907.80 frames. ], batch size: 124, lr: 5.34e-03, grad_scale: 16.0 2023-06-23 17:53:32,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=918798.0, ans=0.0 2023-06-23 17:54:05,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=918858.0, ans=0.125 2023-06-23 17:54:10,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=918918.0, ans=0.0 2023-06-23 17:54:46,836 INFO [train.py:996] (3/4) Epoch 6, batch 700, loss[loss=0.224, simple_loss=0.3071, pruned_loss=0.07044, over 21771.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3103, pruned_loss=0.07773, over 4145983.20 frames. ], batch size: 112, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 17:55:18,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-23 17:55:26,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.80 vs. limit=15.0 2023-06-23 17:55:38,548 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.507e+02 2.938e+02 3.548e+02 4.696e+02, threshold=5.875e+02, percent-clipped=0.0 2023-06-23 17:55:58,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=919158.0, ans=0.125 2023-06-23 17:56:35,819 INFO [train.py:996] (3/4) Epoch 6, batch 750, loss[loss=0.1924, simple_loss=0.2601, pruned_loss=0.06241, over 21360.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3118, pruned_loss=0.07911, over 4181584.95 frames. ], batch size: 194, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 17:58:01,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=919518.0, ans=0.125 2023-06-23 17:58:16,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=919578.0, ans=0.2 2023-06-23 17:58:24,787 INFO [train.py:996] (3/4) Epoch 6, batch 800, loss[loss=0.2251, simple_loss=0.2942, pruned_loss=0.07796, over 21947.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3096, pruned_loss=0.07885, over 4201527.98 frames. ], batch size: 333, lr: 5.33e-03, grad_scale: 32.0 2023-06-23 17:58:32,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=919638.0, ans=0.0 2023-06-23 17:58:37,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=919638.0, ans=0.125 2023-06-23 17:59:04,526 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.561e+02 2.955e+02 3.550e+02 6.098e+02, threshold=5.911e+02, percent-clipped=2.0 2023-06-23 17:59:05,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919758.0, ans=0.1 2023-06-23 17:59:32,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.36 vs. limit=15.0 2023-06-23 17:59:36,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=919818.0, ans=0.0 2023-06-23 18:00:08,608 INFO [train.py:996] (3/4) Epoch 6, batch 850, loss[loss=0.2249, simple_loss=0.2977, pruned_loss=0.07604, over 21240.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3062, pruned_loss=0.07847, over 4226226.13 frames. ], batch size: 144, lr: 5.33e-03, grad_scale: 32.0 2023-06-23 18:00:16,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=919938.0, ans=0.5 2023-06-23 18:00:36,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=919998.0, ans=0.125 2023-06-23 18:00:57,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-23 18:01:14,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=920058.0, ans=0.125 2023-06-23 18:01:24,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=920118.0, ans=0.2 2023-06-23 18:01:34,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2023-06-23 18:01:59,419 INFO [train.py:996] (3/4) Epoch 6, batch 900, loss[loss=0.1953, simple_loss=0.2673, pruned_loss=0.06169, over 21760.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3032, pruned_loss=0.07702, over 4230153.65 frames. ], batch size: 124, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:02:25,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=920298.0, ans=0.0 2023-06-23 18:02:27,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=920298.0, ans=0.125 2023-06-23 18:02:49,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.57 vs. limit=15.0 2023-06-23 18:02:53,506 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.549e+02 3.030e+02 3.332e+02 5.799e+02, threshold=6.061e+02, percent-clipped=0.0 2023-06-23 18:03:14,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=920418.0, ans=0.2 2023-06-23 18:03:31,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=920478.0, ans=0.125 2023-06-23 18:03:50,034 INFO [train.py:996] (3/4) Epoch 6, batch 950, loss[loss=0.2231, simple_loss=0.3108, pruned_loss=0.06767, over 21762.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3012, pruned_loss=0.07704, over 4240159.47 frames. ], batch size: 247, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:04:47,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=920658.0, ans=0.1 2023-06-23 18:05:20,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=920718.0, ans=0.125 2023-06-23 18:05:35,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-23 18:05:38,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=920778.0, ans=0.2 2023-06-23 18:05:41,239 INFO [train.py:996] (3/4) Epoch 6, batch 1000, loss[loss=0.2308, simple_loss=0.323, pruned_loss=0.06933, over 21717.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3007, pruned_loss=0.07707, over 4257911.67 frames. ], batch size: 414, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:05:57,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=920838.0, ans=0.1 2023-06-23 18:06:06,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=920838.0, ans=0.1 2023-06-23 18:06:37,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=920898.0, ans=0.0 2023-06-23 18:06:42,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.855e+02 2.583e+02 2.913e+02 3.407e+02 5.854e+02, threshold=5.827e+02, percent-clipped=0.0 2023-06-23 18:07:29,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=921078.0, ans=0.0 2023-06-23 18:07:32,549 INFO [train.py:996] (3/4) Epoch 6, batch 1050, loss[loss=0.3164, simple_loss=0.365, pruned_loss=0.1339, over 21456.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3018, pruned_loss=0.07809, over 4260804.40 frames. ], batch size: 507, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:08:33,564 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:08:35,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-23 18:08:53,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=921318.0, ans=0.0 2023-06-23 18:09:31,296 INFO [train.py:996] (3/4) Epoch 6, batch 1100, loss[loss=0.2795, simple_loss=0.3461, pruned_loss=0.1064, over 21577.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3023, pruned_loss=0.07735, over 4267571.42 frames. ], batch size: 414, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:09:52,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=921438.0, ans=0.1 2023-06-23 18:10:02,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-06-23 18:10:18,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.95 vs. limit=15.0 2023-06-23 18:10:25,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.670e+02 3.079e+02 4.028e+02 7.418e+02, threshold=6.158e+02, percent-clipped=6.0 2023-06-23 18:10:39,339 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-23 18:11:19,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=921678.0, ans=0.035 2023-06-23 18:11:29,953 INFO [train.py:996] (3/4) Epoch 6, batch 1150, loss[loss=0.2826, simple_loss=0.3393, pruned_loss=0.1129, over 21532.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3029, pruned_loss=0.07667, over 4274775.76 frames. ], batch size: 471, lr: 5.33e-03, grad_scale: 16.0 2023-06-23 18:12:34,781 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:13:06,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-23 18:13:17,164 INFO [train.py:996] (3/4) Epoch 6, batch 1200, loss[loss=0.1965, simple_loss=0.2384, pruned_loss=0.07734, over 19970.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.302, pruned_loss=0.0758, over 4277739.37 frames. ], batch size: 703, lr: 5.33e-03, grad_scale: 32.0 2023-06-23 18:13:59,211 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.616e+02 3.018e+02 3.638e+02 5.698e+02, threshold=6.035e+02, percent-clipped=0.0 2023-06-23 18:14:10,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=922158.0, ans=0.125 2023-06-23 18:14:21,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-23 18:14:36,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-23 18:14:49,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=922278.0, ans=0.125 2023-06-23 18:15:02,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=922278.0, ans=0.125 2023-06-23 18:15:07,508 INFO [train.py:996] (3/4) Epoch 6, batch 1250, loss[loss=0.2538, simple_loss=0.319, pruned_loss=0.09434, over 21273.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3035, pruned_loss=0.07742, over 4276826.97 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:15:09,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=922338.0, ans=0.0 2023-06-23 18:15:31,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=922398.0, ans=0.125 2023-06-23 18:15:43,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-23 18:15:52,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=922458.0, ans=0.125 2023-06-23 18:16:04,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=922458.0, ans=0.125 2023-06-23 18:16:27,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=922518.0, ans=0.125 2023-06-23 18:16:59,815 INFO [train.py:996] (3/4) Epoch 6, batch 1300, loss[loss=0.2563, simple_loss=0.3426, pruned_loss=0.08498, over 21771.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3055, pruned_loss=0.07809, over 4277694.80 frames. ], batch size: 351, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:17:02,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=922638.0, ans=0.125 2023-06-23 18:17:18,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=922638.0, ans=0.0 2023-06-23 18:17:22,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=922698.0, ans=0.05 2023-06-23 18:17:31,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=922698.0, ans=0.0 2023-06-23 18:17:42,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.762e+02 3.245e+02 4.001e+02 7.520e+02, threshold=6.490e+02, percent-clipped=2.0 2023-06-23 18:17:56,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=922758.0, ans=0.0 2023-06-23 18:18:07,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=922818.0, ans=0.125 2023-06-23 18:18:11,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=922818.0, ans=0.125 2023-06-23 18:18:34,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=922878.0, ans=0.2 2023-06-23 18:18:38,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=922878.0, ans=0.2 2023-06-23 18:18:46,508 INFO [train.py:996] (3/4) Epoch 6, batch 1350, loss[loss=0.2801, simple_loss=0.3582, pruned_loss=0.101, over 21473.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3058, pruned_loss=0.07827, over 4286700.39 frames. ], batch size: 471, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:19:17,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=922998.0, ans=0.04949747468305833 2023-06-23 18:19:51,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=923118.0, ans=0.125 2023-06-23 18:20:15,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=923118.0, ans=0.0 2023-06-23 18:20:16,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=923118.0, ans=0.125 2023-06-23 18:20:32,025 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:20:36,686 INFO [train.py:996] (3/4) Epoch 6, batch 1400, loss[loss=0.2549, simple_loss=0.34, pruned_loss=0.08492, over 21653.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.304, pruned_loss=0.07715, over 4284493.76 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:21:18,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=923358.0, ans=0.125 2023-06-23 18:21:21,166 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.458e+02 2.680e+02 3.185e+02 5.161e+02, threshold=5.361e+02, percent-clipped=0.0 2023-06-23 18:21:21,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=923358.0, ans=0.0 2023-06-23 18:21:49,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=923418.0, ans=0.1 2023-06-23 18:22:35,813 INFO [train.py:996] (3/4) Epoch 6, batch 1450, loss[loss=0.2249, simple_loss=0.3066, pruned_loss=0.07163, over 21702.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3049, pruned_loss=0.07818, over 4278880.67 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:22:41,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=923538.0, ans=0.1 2023-06-23 18:23:04,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=923598.0, ans=0.0 2023-06-23 18:23:13,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=923658.0, ans=0.0 2023-06-23 18:23:30,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=923658.0, ans=0.0 2023-06-23 18:24:07,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=923778.0, ans=0.125 2023-06-23 18:24:15,611 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.58 vs. limit=15.0 2023-06-23 18:24:26,500 INFO [train.py:996] (3/4) Epoch 6, batch 1500, loss[loss=0.2101, simple_loss=0.2803, pruned_loss=0.06996, over 21811.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3067, pruned_loss=0.07947, over 4285818.94 frames. ], batch size: 282, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:25:06,056 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.614e+02 2.900e+02 3.425e+02 5.180e+02, threshold=5.801e+02, percent-clipped=0.0 2023-06-23 18:25:06,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=923958.0, ans=0.035 2023-06-23 18:25:09,063 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-23 18:25:38,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=924018.0, ans=0.125 2023-06-23 18:25:38,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=924018.0, ans=0.0 2023-06-23 18:25:43,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=924018.0, ans=0.0 2023-06-23 18:26:20,372 INFO [train.py:996] (3/4) Epoch 6, batch 1550, loss[loss=0.1391, simple_loss=0.2005, pruned_loss=0.03885, over 17243.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3051, pruned_loss=0.07812, over 4288087.40 frames. ], batch size: 62, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:27:09,682 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:27:22,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=924258.0, ans=0.125 2023-06-23 18:28:14,002 INFO [train.py:996] (3/4) Epoch 6, batch 1600, loss[loss=0.304, simple_loss=0.369, pruned_loss=0.1195, over 21433.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3041, pruned_loss=0.0772, over 4283101.87 frames. ], batch size: 507, lr: 5.32e-03, grad_scale: 32.0 2023-06-23 18:28:31,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=924498.0, ans=0.1 2023-06-23 18:28:34,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-23 18:28:50,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=924498.0, ans=0.125 2023-06-23 18:29:08,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.611e+02 2.907e+02 3.387e+02 5.572e+02, threshold=5.813e+02, percent-clipped=0.0 2023-06-23 18:30:08,420 INFO [train.py:996] (3/4) Epoch 6, batch 1650, loss[loss=0.1815, simple_loss=0.2479, pruned_loss=0.05753, over 21448.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3028, pruned_loss=0.07712, over 4279341.49 frames. ], batch size: 230, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:30:08,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=924738.0, ans=0.2 2023-06-23 18:30:33,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=924798.0, ans=0.125 2023-06-23 18:30:35,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=924798.0, ans=0.1 2023-06-23 18:30:52,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-23 18:31:02,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=924858.0, ans=0.125 2023-06-23 18:31:37,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=924918.0, ans=0.125 2023-06-23 18:32:02,887 INFO [train.py:996] (3/4) Epoch 6, batch 1700, loss[loss=0.2245, simple_loss=0.2947, pruned_loss=0.07714, over 21682.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3067, pruned_loss=0.07848, over 4283075.78 frames. ], batch size: 230, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:32:27,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=925098.0, ans=0.125 2023-06-23 18:32:45,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=925098.0, ans=0.2 2023-06-23 18:33:01,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.590e+02 2.907e+02 3.447e+02 5.734e+02, threshold=5.814e+02, percent-clipped=0.0 2023-06-23 18:34:02,128 INFO [train.py:996] (3/4) Epoch 6, batch 1750, loss[loss=0.204, simple_loss=0.2814, pruned_loss=0.06333, over 21474.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3036, pruned_loss=0.07555, over 4265967.45 frames. ], batch size: 211, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:34:17,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=925338.0, ans=0.2 2023-06-23 18:35:02,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=925458.0, ans=0.07 2023-06-23 18:35:07,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=925458.0, ans=0.125 2023-06-23 18:35:27,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-23 18:35:46,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=925578.0, ans=0.125 2023-06-23 18:36:02,448 INFO [train.py:996] (3/4) Epoch 6, batch 1800, loss[loss=0.2019, simple_loss=0.2999, pruned_loss=0.05193, over 21746.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3021, pruned_loss=0.07361, over 4274164.05 frames. ], batch size: 352, lr: 5.32e-03, grad_scale: 16.0 2023-06-23 18:36:53,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=925758.0, ans=0.125 2023-06-23 18:36:56,006 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.930e+02 2.395e+02 2.914e+02 3.634e+02 6.423e+02, threshold=5.828e+02, percent-clipped=1.0 2023-06-23 18:36:58,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=925758.0, ans=0.035 2023-06-23 18:37:09,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=925818.0, ans=0.0 2023-06-23 18:37:30,019 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.51 vs. limit=22.5 2023-06-23 18:37:41,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=925878.0, ans=0.125 2023-06-23 18:37:53,462 INFO [train.py:996] (3/4) Epoch 6, batch 1850, loss[loss=0.2118, simple_loss=0.304, pruned_loss=0.05978, over 21784.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3055, pruned_loss=0.07312, over 4277672.44 frames. ], batch size: 282, lr: 5.31e-03, grad_scale: 8.0 2023-06-23 18:38:19,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=925998.0, ans=0.0 2023-06-23 18:39:00,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=926118.0, ans=0.2 2023-06-23 18:39:46,159 INFO [train.py:996] (3/4) Epoch 6, batch 1900, loss[loss=0.1753, simple_loss=0.2609, pruned_loss=0.0448, over 21629.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3051, pruned_loss=0.07337, over 4276222.91 frames. ], batch size: 230, lr: 5.31e-03, grad_scale: 8.0 2023-06-23 18:40:28,221 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:40:29,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=926298.0, ans=0.125 2023-06-23 18:40:39,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.885e+02 2.383e+02 2.644e+02 3.253e+02 4.924e+02, threshold=5.288e+02, percent-clipped=0.0 2023-06-23 18:40:56,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=926418.0, ans=0.2 2023-06-23 18:40:56,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=926418.0, ans=0.1 2023-06-23 18:40:58,066 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:41:34,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=926478.0, ans=0.0 2023-06-23 18:41:37,827 INFO [train.py:996] (3/4) Epoch 6, batch 1950, loss[loss=0.2534, simple_loss=0.3274, pruned_loss=0.0897, over 21365.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3033, pruned_loss=0.07351, over 4283557.15 frames. ], batch size: 549, lr: 5.31e-03, grad_scale: 8.0 2023-06-23 18:42:13,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=926598.0, ans=0.07 2023-06-23 18:42:30,405 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.54 vs. limit=15.0 2023-06-23 18:42:37,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=926658.0, ans=0.125 2023-06-23 18:43:01,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-23 18:43:34,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=926778.0, ans=0.05 2023-06-23 18:43:37,086 INFO [train.py:996] (3/4) Epoch 6, batch 2000, loss[loss=0.1648, simple_loss=0.2414, pruned_loss=0.04413, over 21306.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2996, pruned_loss=0.0722, over 4280654.33 frames. ], batch size: 131, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:43:38,309 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.01 vs. limit=15.0 2023-06-23 18:44:08,496 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=22.5 2023-06-23 18:44:09,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=926898.0, ans=0.2 2023-06-23 18:44:18,737 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=22.5 2023-06-23 18:44:24,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.849e+02 2.599e+02 2.979e+02 3.641e+02 7.240e+02, threshold=5.958e+02, percent-clipped=3.0 2023-06-23 18:44:56,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=927018.0, ans=0.125 2023-06-23 18:45:28,375 INFO [train.py:996] (3/4) Epoch 6, batch 2050, loss[loss=0.2364, simple_loss=0.3168, pruned_loss=0.07794, over 21769.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3002, pruned_loss=0.072, over 4271723.12 frames. ], batch size: 247, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:45:54,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=927198.0, ans=0.125 2023-06-23 18:45:54,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=927198.0, ans=0.0 2023-06-23 18:47:20,410 INFO [train.py:996] (3/4) Epoch 6, batch 2100, loss[loss=0.2372, simple_loss=0.3066, pruned_loss=0.08394, over 20724.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3013, pruned_loss=0.07378, over 4278100.06 frames. ], batch size: 607, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:47:28,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=927438.0, ans=0.125 2023-06-23 18:48:01,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=927558.0, ans=0.2 2023-06-23 18:48:08,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.503e+02 2.741e+02 3.125e+02 4.918e+02, threshold=5.483e+02, percent-clipped=0.0 2023-06-23 18:48:45,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=927678.0, ans=0.2 2023-06-23 18:49:11,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=927738.0, ans=0.125 2023-06-23 18:49:12,077 INFO [train.py:996] (3/4) Epoch 6, batch 2150, loss[loss=0.2207, simple_loss=0.2814, pruned_loss=0.08, over 21493.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3014, pruned_loss=0.07529, over 4278717.53 frames. ], batch size: 441, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:49:18,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=927738.0, ans=0.0 2023-06-23 18:49:44,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=927798.0, ans=0.125 2023-06-23 18:50:21,953 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-23 18:50:59,982 INFO [train.py:996] (3/4) Epoch 6, batch 2200, loss[loss=0.2, simple_loss=0.2721, pruned_loss=0.06396, over 21178.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3028, pruned_loss=0.07569, over 4269999.98 frames. ], batch size: 608, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:51:47,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.632e+02 2.959e+02 3.421e+02 5.687e+02, threshold=5.917e+02, percent-clipped=1.0 2023-06-23 18:52:07,082 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-23 18:52:36,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=928278.0, ans=0.125 2023-06-23 18:52:49,688 INFO [train.py:996] (3/4) Epoch 6, batch 2250, loss[loss=0.1872, simple_loss=0.2554, pruned_loss=0.05951, over 21813.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2999, pruned_loss=0.07368, over 4272499.55 frames. ], batch size: 98, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:53:11,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=928398.0, ans=0.125 2023-06-23 18:53:34,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=928458.0, ans=0.125 2023-06-23 18:54:36,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.53 vs. limit=10.0 2023-06-23 18:54:37,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=928578.0, ans=0.0 2023-06-23 18:54:40,516 INFO [train.py:996] (3/4) Epoch 6, batch 2300, loss[loss=0.2094, simple_loss=0.2773, pruned_loss=0.07078, over 21838.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2966, pruned_loss=0.0732, over 4268420.86 frames. ], batch size: 107, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:55:28,457 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.991e+02 2.420e+02 2.816e+02 3.301e+02 5.962e+02, threshold=5.633e+02, percent-clipped=1.0 2023-06-23 18:55:29,079 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:55:29,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=928758.0, ans=0.0 2023-06-23 18:56:13,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=928878.0, ans=0.1 2023-06-23 18:56:38,344 INFO [train.py:996] (3/4) Epoch 6, batch 2350, loss[loss=0.1865, simple_loss=0.2529, pruned_loss=0.06007, over 21347.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.294, pruned_loss=0.07365, over 4265230.29 frames. ], batch size: 211, lr: 5.31e-03, grad_scale: 16.0 2023-06-23 18:57:06,054 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-23 18:57:14,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.89 vs. limit=10.0 2023-06-23 18:57:25,328 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-23 18:58:01,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=929178.0, ans=0.05 2023-06-23 18:58:30,894 INFO [train.py:996] (3/4) Epoch 6, batch 2400, loss[loss=0.2792, simple_loss=0.3457, pruned_loss=0.1064, over 21592.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.296, pruned_loss=0.07589, over 4270947.40 frames. ], batch size: 415, lr: 5.31e-03, grad_scale: 32.0 2023-06-23 18:59:21,203 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.599e+02 2.851e+02 3.513e+02 5.978e+02, threshold=5.701e+02, percent-clipped=2.0 2023-06-23 18:59:22,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=929358.0, ans=0.125 2023-06-23 18:59:42,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=929418.0, ans=0.0 2023-06-23 19:00:22,695 INFO [train.py:996] (3/4) Epoch 6, batch 2450, loss[loss=0.2305, simple_loss=0.3545, pruned_loss=0.05325, over 20719.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2997, pruned_loss=0.07803, over 4273523.29 frames. ], batch size: 608, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:01:21,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=929718.0, ans=0.0 2023-06-23 19:02:01,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=929778.0, ans=0.0 2023-06-23 19:02:13,080 INFO [train.py:996] (3/4) Epoch 6, batch 2500, loss[loss=0.2286, simple_loss=0.3266, pruned_loss=0.06526, over 21164.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3008, pruned_loss=0.07873, over 4282340.25 frames. ], batch size: 143, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:02:28,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-23 19:02:33,789 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-23 19:03:03,173 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.544e+02 2.837e+02 3.478e+02 5.146e+02, threshold=5.674e+02, percent-clipped=0.0 2023-06-23 19:03:10,152 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.33 vs. limit=15.0 2023-06-23 19:03:40,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=930078.0, ans=0.0 2023-06-23 19:04:01,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=930078.0, ans=0.0 2023-06-23 19:04:03,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=930138.0, ans=0.125 2023-06-23 19:04:04,674 INFO [train.py:996] (3/4) Epoch 6, batch 2550, loss[loss=0.2234, simple_loss=0.3192, pruned_loss=0.06381, over 21720.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2997, pruned_loss=0.07758, over 4278383.65 frames. ], batch size: 247, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:04:05,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=930138.0, ans=0.0 2023-06-23 19:04:08,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=930138.0, ans=0.125 2023-06-23 19:04:30,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=930198.0, ans=0.125 2023-06-23 19:04:32,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=930198.0, ans=0.125 2023-06-23 19:05:57,835 INFO [train.py:996] (3/4) Epoch 6, batch 2600, loss[loss=0.2506, simple_loss=0.3229, pruned_loss=0.08915, over 21930.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3019, pruned_loss=0.07786, over 4274614.04 frames. ], batch size: 372, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:06:22,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-23 19:06:30,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=930558.0, ans=0.035 2023-06-23 19:06:47,823 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.627e+02 2.988e+02 3.634e+02 5.525e+02, threshold=5.976e+02, percent-clipped=0.0 2023-06-23 19:06:56,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=930618.0, ans=0.0 2023-06-23 19:07:16,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=930618.0, ans=0.0 2023-06-23 19:07:24,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=930678.0, ans=0.125 2023-06-23 19:07:26,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=930678.0, ans=0.1 2023-06-23 19:07:28,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=930678.0, ans=0.0 2023-06-23 19:07:29,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=930678.0, ans=0.125 2023-06-23 19:07:44,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=930678.0, ans=0.125 2023-06-23 19:07:49,109 INFO [train.py:996] (3/4) Epoch 6, batch 2650, loss[loss=0.2277, simple_loss=0.2984, pruned_loss=0.07848, over 21914.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3017, pruned_loss=0.07878, over 4283541.97 frames. ], batch size: 351, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:08:01,476 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-23 19:08:17,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=930798.0, ans=0.125 2023-06-23 19:08:40,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=930858.0, ans=0.0 2023-06-23 19:09:05,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-23 19:09:06,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.59 vs. limit=15.0 2023-06-23 19:09:35,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=930978.0, ans=0.0 2023-06-23 19:09:42,221 INFO [train.py:996] (3/4) Epoch 6, batch 2700, loss[loss=0.2286, simple_loss=0.3005, pruned_loss=0.07831, over 21493.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2999, pruned_loss=0.07861, over 4282788.21 frames. ], batch size: 131, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:10:04,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=931098.0, ans=0.125 2023-06-23 19:10:07,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=931098.0, ans=0.125 2023-06-23 19:10:24,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=931158.0, ans=0.2 2023-06-23 19:10:32,949 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.709e+02 3.074e+02 3.590e+02 5.374e+02, threshold=6.148e+02, percent-clipped=0.0 2023-06-23 19:11:05,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=931218.0, ans=15.0 2023-06-23 19:11:34,533 INFO [train.py:996] (3/4) Epoch 6, batch 2750, loss[loss=0.2409, simple_loss=0.3191, pruned_loss=0.08138, over 21797.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2985, pruned_loss=0.07772, over 4285353.44 frames. ], batch size: 124, lr: 5.30e-03, grad_scale: 16.0 2023-06-23 19:11:45,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=931338.0, ans=0.125 2023-06-23 19:12:11,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.45 vs. limit=15.0 2023-06-23 19:12:31,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=15.0 2023-06-23 19:13:24,218 INFO [train.py:996] (3/4) Epoch 6, batch 2800, loss[loss=0.2494, simple_loss=0.315, pruned_loss=0.09192, over 21218.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3026, pruned_loss=0.0779, over 4289572.24 frames. ], batch size: 176, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:13:32,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=931638.0, ans=0.125 2023-06-23 19:13:44,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-23 19:13:48,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-23 19:14:22,172 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.716e+02 3.036e+02 3.413e+02 5.034e+02, threshold=6.071e+02, percent-clipped=0.0 2023-06-23 19:14:42,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=931818.0, ans=0.1 2023-06-23 19:14:59,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-23 19:15:05,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=931878.0, ans=0.1 2023-06-23 19:15:18,159 INFO [train.py:996] (3/4) Epoch 6, batch 2850, loss[loss=0.2568, simple_loss=0.3409, pruned_loss=0.08637, over 20916.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3073, pruned_loss=0.08031, over 4285304.37 frames. ], batch size: 607, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:15:20,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=931938.0, ans=0.125 2023-06-23 19:15:42,496 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-23 19:15:56,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=931998.0, ans=0.125 2023-06-23 19:15:58,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=932058.0, ans=0.125 2023-06-23 19:16:19,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=932058.0, ans=0.125 2023-06-23 19:16:47,209 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:16:47,788 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-23 19:17:07,575 INFO [train.py:996] (3/4) Epoch 6, batch 2900, loss[loss=0.2559, simple_loss=0.3122, pruned_loss=0.09974, over 21757.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3037, pruned_loss=0.07955, over 4284895.25 frames. ], batch size: 473, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:17:29,157 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.46 vs. limit=10.0 2023-06-23 19:17:47,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=932358.0, ans=0.1 2023-06-23 19:18:03,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.630e+02 3.132e+02 3.824e+02 7.694e+02, threshold=6.265e+02, percent-clipped=2.0 2023-06-23 19:18:41,391 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-06-23 19:18:44,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=932478.0, ans=0.125 2023-06-23 19:18:58,071 INFO [train.py:996] (3/4) Epoch 6, batch 2950, loss[loss=0.2194, simple_loss=0.3119, pruned_loss=0.06341, over 21663.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3053, pruned_loss=0.07993, over 4291006.20 frames. ], batch size: 230, lr: 5.30e-03, grad_scale: 32.0 2023-06-23 19:19:01,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-23 19:19:27,142 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-23 19:19:33,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=932598.0, ans=0.0 2023-06-23 19:20:00,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=932658.0, ans=0.05 2023-06-23 19:20:04,979 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-23 19:20:27,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=932718.0, ans=0.125 2023-06-23 19:20:42,815 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-23 19:20:50,753 INFO [train.py:996] (3/4) Epoch 6, batch 3000, loss[loss=0.2277, simple_loss=0.3089, pruned_loss=0.07325, over 21816.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3095, pruned_loss=0.08031, over 4292617.83 frames. ], batch size: 282, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:20:50,753 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 19:21:13,128 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2526, simple_loss=0.3435, pruned_loss=0.08085, over 1796401.00 frames. 2023-06-23 19:21:13,129 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23273MB 2023-06-23 19:21:50,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=932898.0, ans=0.2 2023-06-23 19:21:53,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=932898.0, ans=0.2 2023-06-23 19:22:09,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=932958.0, ans=0.035 2023-06-23 19:22:14,695 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.535e+02 2.851e+02 3.436e+02 5.853e+02, threshold=5.702e+02, percent-clipped=0.0 2023-06-23 19:22:22,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=933018.0, ans=0.0 2023-06-23 19:22:26,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=22.5 2023-06-23 19:22:58,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=933078.0, ans=0.1 2023-06-23 19:23:05,128 INFO [train.py:996] (3/4) Epoch 6, batch 3050, loss[loss=0.1894, simple_loss=0.2797, pruned_loss=0.0495, over 21765.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3096, pruned_loss=0.07794, over 4291671.20 frames. ], batch size: 332, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:23:33,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=933198.0, ans=0.125 2023-06-23 19:24:44,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-23 19:24:53,916 INFO [train.py:996] (3/4) Epoch 6, batch 3100, loss[loss=0.2316, simple_loss=0.3248, pruned_loss=0.06915, over 21633.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3076, pruned_loss=0.07678, over 4282004.20 frames. ], batch size: 389, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:24:54,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=933438.0, ans=0.125 2023-06-23 19:25:55,820 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.716e+02 3.164e+02 3.740e+02 6.470e+02, threshold=6.328e+02, percent-clipped=4.0 2023-06-23 19:26:04,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=933618.0, ans=0.1 2023-06-23 19:26:26,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=933618.0, ans=0.2 2023-06-23 19:26:52,674 INFO [train.py:996] (3/4) Epoch 6, batch 3150, loss[loss=0.2567, simple_loss=0.3314, pruned_loss=0.09099, over 21397.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3101, pruned_loss=0.07784, over 4280938.31 frames. ], batch size: 159, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:27:41,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=933858.0, ans=0.125 2023-06-23 19:28:56,530 INFO [train.py:996] (3/4) Epoch 6, batch 3200, loss[loss=0.2041, simple_loss=0.3014, pruned_loss=0.05346, over 21739.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3133, pruned_loss=0.07862, over 4284669.54 frames. ], batch size: 351, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:29:08,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-06-23 19:29:16,434 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:29:46,296 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.879e+02 2.521e+02 2.818e+02 3.375e+02 4.819e+02, threshold=5.636e+02, percent-clipped=0.0 2023-06-23 19:30:11,016 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-23 19:30:15,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=934218.0, ans=0.125 2023-06-23 19:30:40,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=934278.0, ans=0.0 2023-06-23 19:30:46,849 INFO [train.py:996] (3/4) Epoch 6, batch 3250, loss[loss=0.2256, simple_loss=0.29, pruned_loss=0.08058, over 21844.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3131, pruned_loss=0.07957, over 4281608.02 frames. ], batch size: 98, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:30:56,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=934338.0, ans=0.1 2023-06-23 19:32:38,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=934578.0, ans=0.0 2023-06-23 19:32:41,465 INFO [train.py:996] (3/4) Epoch 6, batch 3300, loss[loss=0.2519, simple_loss=0.3438, pruned_loss=0.07998, over 21589.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3057, pruned_loss=0.07908, over 4274441.86 frames. ], batch size: 441, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:32:45,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=934638.0, ans=0.0 2023-06-23 19:33:38,656 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.620e+02 2.941e+02 3.334e+02 7.153e+02, threshold=5.881e+02, percent-clipped=1.0 2023-06-23 19:33:46,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=934818.0, ans=0.125 2023-06-23 19:34:11,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-23 19:34:32,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=934938.0, ans=0.2 2023-06-23 19:34:33,180 INFO [train.py:996] (3/4) Epoch 6, batch 3350, loss[loss=0.2447, simple_loss=0.3171, pruned_loss=0.08612, over 21381.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3087, pruned_loss=0.07909, over 4281433.43 frames. ], batch size: 131, lr: 5.29e-03, grad_scale: 32.0 2023-06-23 19:34:51,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=934998.0, ans=0.1 2023-06-23 19:34:53,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=934998.0, ans=0.125 2023-06-23 19:34:57,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=934998.0, ans=0.05 2023-06-23 19:35:18,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=935058.0, ans=0.0 2023-06-23 19:35:32,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=935058.0, ans=0.125 2023-06-23 19:36:09,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-23 19:36:25,760 INFO [train.py:996] (3/4) Epoch 6, batch 3400, loss[loss=0.2039, simple_loss=0.2802, pruned_loss=0.06374, over 21364.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3095, pruned_loss=0.08081, over 4288688.20 frames. ], batch size: 144, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:37:11,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=935358.0, ans=0.125 2023-06-23 19:37:30,208 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.631e+02 2.892e+02 3.496e+02 6.427e+02, threshold=5.784e+02, percent-clipped=1.0 2023-06-23 19:37:48,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=935418.0, ans=0.0 2023-06-23 19:38:04,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=935478.0, ans=0.125 2023-06-23 19:38:05,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-23 19:38:18,966 INFO [train.py:996] (3/4) Epoch 6, batch 3450, loss[loss=0.2209, simple_loss=0.2643, pruned_loss=0.08872, over 20136.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3035, pruned_loss=0.07945, over 4277273.49 frames. ], batch size: 707, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:40:16,021 INFO [train.py:996] (3/4) Epoch 6, batch 3500, loss[loss=0.2068, simple_loss=0.268, pruned_loss=0.07285, over 21275.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3117, pruned_loss=0.08326, over 4281996.77 frames. ], batch size: 608, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:40:25,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=935838.0, ans=0.125 2023-06-23 19:40:48,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=935898.0, ans=0.07 2023-06-23 19:41:04,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=935898.0, ans=0.0 2023-06-23 19:41:16,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.777e+02 3.098e+02 3.671e+02 6.397e+02, threshold=6.196e+02, percent-clipped=1.0 2023-06-23 19:41:27,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=22.5 2023-06-23 19:41:32,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=936018.0, ans=0.0 2023-06-23 19:41:48,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.48 vs. limit=15.0 2023-06-23 19:42:08,051 INFO [train.py:996] (3/4) Epoch 6, batch 3550, loss[loss=0.2257, simple_loss=0.2921, pruned_loss=0.07963, over 21392.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3151, pruned_loss=0.08459, over 4287444.93 frames. ], batch size: 194, lr: 5.29e-03, grad_scale: 16.0 2023-06-23 19:42:16,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=936138.0, ans=0.0 2023-06-23 19:42:54,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=936198.0, ans=0.125 2023-06-23 19:42:56,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=936198.0, ans=0.125 2023-06-23 19:43:06,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=936258.0, ans=0.2 2023-06-23 19:43:35,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=12.0 2023-06-23 19:43:45,918 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=22.5 2023-06-23 19:43:51,978 INFO [train.py:996] (3/4) Epoch 6, batch 3600, loss[loss=0.2283, simple_loss=0.2967, pruned_loss=0.07991, over 21668.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3097, pruned_loss=0.0835, over 4288636.45 frames. ], batch size: 298, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:44:18,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=936438.0, ans=0.0 2023-06-23 19:44:59,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=936558.0, ans=0.125 2023-06-23 19:45:00,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.636e+02 3.056e+02 3.547e+02 6.528e+02, threshold=6.113e+02, percent-clipped=1.0 2023-06-23 19:45:32,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=936678.0, ans=0.125 2023-06-23 19:45:48,085 INFO [train.py:996] (3/4) Epoch 6, batch 3650, loss[loss=0.1975, simple_loss=0.2826, pruned_loss=0.05624, over 21608.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3104, pruned_loss=0.0837, over 4283984.64 frames. ], batch size: 230, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:46:06,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=936738.0, ans=0.1 2023-06-23 19:46:20,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-23 19:47:36,911 INFO [train.py:996] (3/4) Epoch 6, batch 3700, loss[loss=0.2182, simple_loss=0.2958, pruned_loss=0.07029, over 21839.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3098, pruned_loss=0.08249, over 4288343.51 frames. ], batch size: 332, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:48:38,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.573e+02 2.941e+02 3.537e+02 5.018e+02, threshold=5.882e+02, percent-clipped=0.0 2023-06-23 19:48:58,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=937218.0, ans=0.2 2023-06-23 19:49:00,607 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.18 vs. limit=15.0 2023-06-23 19:49:07,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=937278.0, ans=0.125 2023-06-23 19:49:27,055 INFO [train.py:996] (3/4) Epoch 6, batch 3750, loss[loss=0.193, simple_loss=0.2537, pruned_loss=0.06614, over 21277.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3081, pruned_loss=0.0821, over 4296162.13 frames. ], batch size: 549, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:50:03,842 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-23 19:50:29,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=937458.0, ans=0.125 2023-06-23 19:51:03,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=937578.0, ans=0.125 2023-06-23 19:51:29,602 INFO [train.py:996] (3/4) Epoch 6, batch 3800, loss[loss=0.2235, simple_loss=0.2995, pruned_loss=0.07373, over 21815.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3055, pruned_loss=0.08033, over 4294506.42 frames. ], batch size: 247, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:51:40,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=937638.0, ans=0.0 2023-06-23 19:51:54,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=937698.0, ans=0.025 2023-06-23 19:52:00,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=12.0 2023-06-23 19:52:21,708 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 2.479e+02 2.831e+02 3.335e+02 6.491e+02, threshold=5.662e+02, percent-clipped=1.0 2023-06-23 19:52:26,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=937818.0, ans=0.125 2023-06-23 19:52:26,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=937818.0, ans=0.1 2023-06-23 19:53:20,131 INFO [train.py:996] (3/4) Epoch 6, batch 3850, loss[loss=0.2259, simple_loss=0.2896, pruned_loss=0.08113, over 21875.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3051, pruned_loss=0.08084, over 4279536.75 frames. ], batch size: 107, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:53:32,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=937938.0, ans=0.125 2023-06-23 19:54:27,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=938118.0, ans=10.0 2023-06-23 19:54:51,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=938178.0, ans=0.0 2023-06-23 19:55:09,877 INFO [train.py:996] (3/4) Epoch 6, batch 3900, loss[loss=0.225, simple_loss=0.2867, pruned_loss=0.08158, over 21594.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3008, pruned_loss=0.08063, over 4274441.79 frames. ], batch size: 212, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:55:27,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=22.5 2023-06-23 19:55:30,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=938238.0, ans=0.125 2023-06-23 19:55:34,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-23 19:56:02,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 2.781e+02 3.101e+02 3.883e+02 8.958e+02, threshold=6.202e+02, percent-clipped=3.0 2023-06-23 19:56:28,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=938418.0, ans=0.0 2023-06-23 19:57:06,045 INFO [train.py:996] (3/4) Epoch 6, batch 3950, loss[loss=0.1651, simple_loss=0.2475, pruned_loss=0.04137, over 21479.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.303, pruned_loss=0.0799, over 4274665.62 frames. ], batch size: 212, lr: 5.28e-03, grad_scale: 16.0 2023-06-23 19:57:09,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=938538.0, ans=0.0 2023-06-23 19:57:44,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=938658.0, ans=0.025 2023-06-23 19:58:23,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-23 19:58:56,503 INFO [train.py:996] (3/4) Epoch 6, batch 4000, loss[loss=0.2069, simple_loss=0.2703, pruned_loss=0.0718, over 21286.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2954, pruned_loss=0.07586, over 4275212.64 frames. ], batch size: 144, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 19:59:15,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=938898.0, ans=0.0 2023-06-23 19:59:22,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=938898.0, ans=0.0 2023-06-23 19:59:34,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=938958.0, ans=0.1 2023-06-23 19:59:41,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=938958.0, ans=0.125 2023-06-23 19:59:44,410 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.407e+02 2.711e+02 3.233e+02 5.039e+02, threshold=5.423e+02, percent-clipped=0.0 2023-06-23 20:00:23,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=939078.0, ans=0.0 2023-06-23 20:00:23,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=939078.0, ans=0.125 2023-06-23 20:00:47,435 INFO [train.py:996] (3/4) Epoch 6, batch 4050, loss[loss=0.2737, simple_loss=0.3332, pruned_loss=0.1071, over 21617.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2938, pruned_loss=0.07397, over 4278863.30 frames. ], batch size: 507, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 20:01:12,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=939198.0, ans=0.2 2023-06-23 20:01:16,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=939198.0, ans=0.2 2023-06-23 20:01:57,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-23 20:02:10,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=939378.0, ans=0.1 2023-06-23 20:02:32,564 INFO [train.py:996] (3/4) Epoch 6, batch 4100, loss[loss=0.1974, simple_loss=0.2758, pruned_loss=0.0595, over 21543.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2957, pruned_loss=0.07432, over 4279413.21 frames. ], batch size: 212, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 20:03:25,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=939558.0, ans=0.0 2023-06-23 20:03:26,707 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.413e+02 2.658e+02 3.099e+02 5.779e+02, threshold=5.316e+02, percent-clipped=1.0 2023-06-23 20:04:18,589 INFO [train.py:996] (3/4) Epoch 6, batch 4150, loss[loss=0.2037, simple_loss=0.2936, pruned_loss=0.0569, over 21661.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2961, pruned_loss=0.07256, over 4279546.72 frames. ], batch size: 263, lr: 5.28e-03, grad_scale: 32.0 2023-06-23 20:04:32,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=939738.0, ans=0.125 2023-06-23 20:04:41,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-23 20:05:10,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=939858.0, ans=0.125 2023-06-23 20:05:50,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=939918.0, ans=0.0 2023-06-23 20:06:01,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=939978.0, ans=0.125 2023-06-23 20:06:12,055 INFO [train.py:996] (3/4) Epoch 6, batch 4200, loss[loss=0.1988, simple_loss=0.2771, pruned_loss=0.06025, over 15633.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2958, pruned_loss=0.0713, over 4267790.77 frames. ], batch size: 61, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:07:18,287 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.286e+02 2.656e+02 3.507e+02 6.693e+02, threshold=5.313e+02, percent-clipped=3.0 2023-06-23 20:07:31,926 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0 2023-06-23 20:07:42,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=940218.0, ans=0.125 2023-06-23 20:08:05,640 INFO [train.py:996] (3/4) Epoch 6, batch 4250, loss[loss=0.1973, simple_loss=0.2644, pruned_loss=0.06511, over 21224.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3029, pruned_loss=0.07351, over 4270508.49 frames. ], batch size: 176, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:08:09,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=940338.0, ans=0.1 2023-06-23 20:08:39,820 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:08:41,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=940398.0, ans=0.2 2023-06-23 20:09:15,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=940458.0, ans=0.125 2023-06-23 20:09:27,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=940518.0, ans=0.125 2023-06-23 20:09:39,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=940578.0, ans=0.125 2023-06-23 20:09:59,053 INFO [train.py:996] (3/4) Epoch 6, batch 4300, loss[loss=0.2877, simple_loss=0.3792, pruned_loss=0.09806, over 21473.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3108, pruned_loss=0.07584, over 4269715.28 frames. ], batch size: 471, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:10:28,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=940698.0, ans=15.0 2023-06-23 20:10:53,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=940698.0, ans=0.0 2023-06-23 20:11:04,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=940758.0, ans=15.0 2023-06-23 20:11:11,075 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.724e+02 3.223e+02 4.213e+02 6.998e+02, threshold=6.446e+02, percent-clipped=6.0 2023-06-23 20:11:18,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=940818.0, ans=0.125 2023-06-23 20:11:45,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=940878.0, ans=0.0 2023-06-23 20:12:00,244 INFO [train.py:996] (3/4) Epoch 6, batch 4350, loss[loss=0.1872, simple_loss=0.2534, pruned_loss=0.06053, over 21543.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.308, pruned_loss=0.07534, over 4259051.14 frames. ], batch size: 247, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:12:40,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=940998.0, ans=0.1 2023-06-23 20:13:09,977 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.43 vs. limit=6.0 2023-06-23 20:13:51,992 INFO [train.py:996] (3/4) Epoch 6, batch 4400, loss[loss=0.2118, simple_loss=0.2984, pruned_loss=0.06262, over 21378.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3041, pruned_loss=0.07476, over 4259662.71 frames. ], batch size: 176, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:14:01,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=941238.0, ans=0.1 2023-06-23 20:14:53,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.531e+02 2.865e+02 3.462e+02 7.210e+02, threshold=5.730e+02, percent-clipped=2.0 2023-06-23 20:15:30,603 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.11 vs. limit=22.5 2023-06-23 20:15:35,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=941478.0, ans=0.125 2023-06-23 20:15:43,302 INFO [train.py:996] (3/4) Epoch 6, batch 4450, loss[loss=0.2246, simple_loss=0.3132, pruned_loss=0.06798, over 21570.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3111, pruned_loss=0.07626, over 4261494.49 frames. ], batch size: 230, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:16:01,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=941538.0, ans=0.125 2023-06-23 20:16:15,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=941598.0, ans=0.125 2023-06-23 20:17:38,974 INFO [train.py:996] (3/4) Epoch 6, batch 4500, loss[loss=0.2343, simple_loss=0.3298, pruned_loss=0.06938, over 20116.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3122, pruned_loss=0.07769, over 4275785.35 frames. ], batch size: 702, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:18:03,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=941898.0, ans=0.125 2023-06-23 20:18:30,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-23 20:18:32,698 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.439e+02 2.793e+02 3.421e+02 5.110e+02, threshold=5.586e+02, percent-clipped=0.0 2023-06-23 20:18:57,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=942018.0, ans=0.0 2023-06-23 20:19:02,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=942018.0, ans=0.1 2023-06-23 20:19:33,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=942138.0, ans=0.2 2023-06-23 20:19:34,555 INFO [train.py:996] (3/4) Epoch 6, batch 4550, loss[loss=0.2712, simple_loss=0.342, pruned_loss=0.1001, over 21329.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3154, pruned_loss=0.07898, over 4279502.61 frames. ], batch size: 548, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:19:48,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.28 vs. limit=5.0 2023-06-23 20:19:54,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=942198.0, ans=0.1 2023-06-23 20:20:32,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-23 20:21:25,288 INFO [train.py:996] (3/4) Epoch 6, batch 4600, loss[loss=0.2318, simple_loss=0.3176, pruned_loss=0.073, over 21650.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3184, pruned_loss=0.0807, over 4283643.88 frames. ], batch size: 389, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:21:25,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=942438.0, ans=0.125 2023-06-23 20:21:52,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=942498.0, ans=0.125 2023-06-23 20:22:16,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-23 20:22:25,587 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.585e+02 3.169e+02 3.580e+02 7.815e+02, threshold=6.337e+02, percent-clipped=3.0 2023-06-23 20:22:29,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=942618.0, ans=0.125 2023-06-23 20:23:03,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=942678.0, ans=0.0 2023-06-23 20:23:13,749 INFO [train.py:996] (3/4) Epoch 6, batch 4650, loss[loss=0.1541, simple_loss=0.2326, pruned_loss=0.03784, over 21316.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3113, pruned_loss=0.07882, over 4292096.13 frames. ], batch size: 176, lr: 5.27e-03, grad_scale: 32.0 2023-06-23 20:23:21,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=942738.0, ans=0.0 2023-06-23 20:23:23,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=942738.0, ans=0.2 2023-06-23 20:23:37,518 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:24:04,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=942858.0, ans=0.2 2023-06-23 20:24:23,828 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.24 vs. limit=10.0 2023-06-23 20:25:03,300 INFO [train.py:996] (3/4) Epoch 6, batch 4700, loss[loss=0.2257, simple_loss=0.2714, pruned_loss=0.08995, over 20085.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3008, pruned_loss=0.07609, over 4280941.80 frames. ], batch size: 707, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:25:15,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=943038.0, ans=0.2 2023-06-23 20:26:04,152 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.777e+02 2.385e+02 2.698e+02 3.095e+02 5.090e+02, threshold=5.395e+02, percent-clipped=0.0 2023-06-23 20:26:34,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.58 vs. limit=22.5 2023-06-23 20:26:50,577 INFO [train.py:996] (3/4) Epoch 6, batch 4750, loss[loss=0.2314, simple_loss=0.2984, pruned_loss=0.08221, over 22049.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2951, pruned_loss=0.07607, over 4282824.72 frames. ], batch size: 119, lr: 5.27e-03, grad_scale: 16.0 2023-06-23 20:26:59,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=943338.0, ans=0.07 2023-06-23 20:28:38,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=943638.0, ans=0.2 2023-06-23 20:28:39,497 INFO [train.py:996] (3/4) Epoch 6, batch 4800, loss[loss=0.2113, simple_loss=0.2925, pruned_loss=0.06509, over 21744.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2964, pruned_loss=0.07662, over 4293027.71 frames. ], batch size: 247, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:28:54,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=943638.0, ans=0.0 2023-06-23 20:29:21,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=943698.0, ans=0.2 2023-06-23 20:29:42,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.734e+02 3.125e+02 3.511e+02 5.007e+02, threshold=6.249e+02, percent-clipped=0.0 2023-06-23 20:29:55,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=943818.0, ans=0.5 2023-06-23 20:30:12,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=943878.0, ans=0.0 2023-06-23 20:30:27,070 INFO [train.py:996] (3/4) Epoch 6, batch 4850, loss[loss=0.2265, simple_loss=0.2926, pruned_loss=0.08016, over 21674.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2974, pruned_loss=0.0764, over 4295689.08 frames. ], batch size: 230, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:30:33,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=943938.0, ans=0.0 2023-06-23 20:30:36,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=943938.0, ans=0.035 2023-06-23 20:30:40,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=943938.0, ans=0.0 2023-06-23 20:30:43,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=943998.0, ans=0.1 2023-06-23 20:31:07,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=943998.0, ans=0.0 2023-06-23 20:31:13,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=944058.0, ans=0.125 2023-06-23 20:31:33,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.75 vs. limit=12.0 2023-06-23 20:31:34,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=944118.0, ans=0.2 2023-06-23 20:32:17,551 INFO [train.py:996] (3/4) Epoch 6, batch 4900, loss[loss=0.2489, simple_loss=0.3268, pruned_loss=0.08552, over 21292.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3002, pruned_loss=0.07776, over 4299962.65 frames. ], batch size: 159, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:32:18,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=944238.0, ans=0.0 2023-06-23 20:32:42,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=944298.0, ans=0.125 2023-06-23 20:32:45,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=944298.0, ans=0.07 2023-06-23 20:33:01,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=944298.0, ans=0.2 2023-06-23 20:33:11,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.55 vs. limit=22.5 2023-06-23 20:33:28,629 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.472e+02 2.764e+02 3.016e+02 5.453e+02, threshold=5.528e+02, percent-clipped=0.0 2023-06-23 20:33:50,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=944478.0, ans=0.0 2023-06-23 20:34:05,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-23 20:34:09,896 INFO [train.py:996] (3/4) Epoch 6, batch 4950, loss[loss=0.2067, simple_loss=0.3066, pruned_loss=0.05345, over 21636.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3054, pruned_loss=0.07616, over 4295822.47 frames. ], batch size: 414, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:35:07,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=944658.0, ans=0.125 2023-06-23 20:35:34,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=944718.0, ans=0.95 2023-06-23 20:35:47,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=944778.0, ans=0.0 2023-06-23 20:35:58,233 INFO [train.py:996] (3/4) Epoch 6, batch 5000, loss[loss=0.2383, simple_loss=0.3151, pruned_loss=0.08074, over 21753.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3038, pruned_loss=0.0729, over 4290675.73 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:36:04,312 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-23 20:37:01,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.469e+02 2.951e+02 3.464e+02 5.172e+02, threshold=5.903e+02, percent-clipped=0.0 2023-06-23 20:37:16,901 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:37:37,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=945078.0, ans=0.0 2023-06-23 20:37:40,283 INFO [train.py:996] (3/4) Epoch 6, batch 5050, loss[loss=0.2178, simple_loss=0.2916, pruned_loss=0.07206, over 21433.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.304, pruned_loss=0.07459, over 4289094.60 frames. ], batch size: 194, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:38:00,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-23 20:38:01,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=945198.0, ans=0.125 2023-06-23 20:38:24,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-23 20:39:26,499 INFO [train.py:996] (3/4) Epoch 6, batch 5100, loss[loss=0.2537, simple_loss=0.3212, pruned_loss=0.09313, over 21776.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3021, pruned_loss=0.0749, over 4294248.79 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:39:48,723 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.85 vs. limit=10.0 2023-06-23 20:40:06,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-23 20:40:06,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-23 20:40:30,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.074e+02 2.802e+02 3.209e+02 3.785e+02 5.711e+02, threshold=6.418e+02, percent-clipped=0.0 2023-06-23 20:40:32,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=945618.0, ans=0.0 2023-06-23 20:40:44,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=945618.0, ans=0.125 2023-06-23 20:41:07,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=945678.0, ans=0.0 2023-06-23 20:41:15,793 INFO [train.py:996] (3/4) Epoch 6, batch 5150, loss[loss=0.2079, simple_loss=0.2815, pruned_loss=0.06713, over 21729.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2992, pruned_loss=0.07495, over 4292974.74 frames. ], batch size: 247, lr: 5.26e-03, grad_scale: 16.0 2023-06-23 20:41:33,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=945738.0, ans=0.0 2023-06-23 20:41:50,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=945798.0, ans=0.125 2023-06-23 20:41:55,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=945798.0, ans=0.125 2023-06-23 20:41:58,108 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-23 20:42:44,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=945918.0, ans=0.125 2023-06-23 20:43:05,737 INFO [train.py:996] (3/4) Epoch 6, batch 5200, loss[loss=0.2662, simple_loss=0.3652, pruned_loss=0.08357, over 21208.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3022, pruned_loss=0.07611, over 4289657.82 frames. ], batch size: 548, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:43:33,587 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:43:37,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=946098.0, ans=15.0 2023-06-23 20:44:04,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=946158.0, ans=0.125 2023-06-23 20:44:14,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.657e+02 3.031e+02 3.767e+02 5.750e+02, threshold=6.062e+02, percent-clipped=0.0 2023-06-23 20:44:42,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=946278.0, ans=0.1 2023-06-23 20:44:49,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=946278.0, ans=10.0 2023-06-23 20:44:59,545 INFO [train.py:996] (3/4) Epoch 6, batch 5250, loss[loss=0.2057, simple_loss=0.2948, pruned_loss=0.05832, over 21408.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3071, pruned_loss=0.07533, over 4292012.92 frames. ], batch size: 211, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:46:22,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-23 20:46:52,847 INFO [train.py:996] (3/4) Epoch 6, batch 5300, loss[loss=0.2178, simple_loss=0.288, pruned_loss=0.07384, over 21894.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3057, pruned_loss=0.07501, over 4296419.34 frames. ], batch size: 107, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:47:07,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=946638.0, ans=0.125 2023-06-23 20:47:21,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=946698.0, ans=0.1 2023-06-23 20:47:37,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=22.5 2023-06-23 20:47:40,559 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2023-06-23 20:47:40,617 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-23 20:47:47,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=946758.0, ans=0.125 2023-06-23 20:47:55,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.539e+02 2.781e+02 3.236e+02 4.836e+02, threshold=5.563e+02, percent-clipped=0.0 2023-06-23 20:48:41,803 INFO [train.py:996] (3/4) Epoch 6, batch 5350, loss[loss=0.2246, simple_loss=0.3005, pruned_loss=0.07435, over 21723.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3048, pruned_loss=0.0767, over 4300461.09 frames. ], batch size: 112, lr: 5.26e-03, grad_scale: 32.0 2023-06-23 20:48:44,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=946938.0, ans=0.2 2023-06-23 20:48:49,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=946938.0, ans=0.1 2023-06-23 20:49:15,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-23 20:50:29,945 INFO [train.py:996] (3/4) Epoch 6, batch 5400, loss[loss=0.2075, simple_loss=0.2912, pruned_loss=0.06191, over 21684.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3034, pruned_loss=0.07737, over 4289295.27 frames. ], batch size: 389, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:50:43,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.80 vs. limit=15.0 2023-06-23 20:50:50,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=22.5 2023-06-23 20:51:05,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=947298.0, ans=0.1 2023-06-23 20:51:16,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=947358.0, ans=0.125 2023-06-23 20:51:21,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=947358.0, ans=0.0 2023-06-23 20:51:34,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 2.654e+02 3.257e+02 3.898e+02 6.722e+02, threshold=6.513e+02, percent-clipped=2.0 2023-06-23 20:51:45,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=947418.0, ans=0.2 2023-06-23 20:52:19,512 INFO [train.py:996] (3/4) Epoch 6, batch 5450, loss[loss=0.2024, simple_loss=0.2913, pruned_loss=0.05678, over 21662.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3031, pruned_loss=0.07579, over 4295297.72 frames. ], batch size: 263, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:54:06,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=947778.0, ans=0.125 2023-06-23 20:54:09,231 INFO [train.py:996] (3/4) Epoch 6, batch 5500, loss[loss=0.1966, simple_loss=0.2915, pruned_loss=0.05086, over 21435.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3079, pruned_loss=0.07349, over 4293937.35 frames. ], batch size: 211, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:54:30,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=947898.0, ans=0.125 2023-06-23 20:54:39,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-23 20:55:24,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.705e+02 2.255e+02 2.654e+02 3.007e+02 4.668e+02, threshold=5.308e+02, percent-clipped=0.0 2023-06-23 20:55:41,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.87 vs. limit=6.0 2023-06-23 20:55:45,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.26 vs. limit=22.5 2023-06-23 20:56:04,054 INFO [train.py:996] (3/4) Epoch 6, batch 5550, loss[loss=0.2717, simple_loss=0.3614, pruned_loss=0.091, over 21457.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3085, pruned_loss=0.07118, over 4291463.39 frames. ], batch size: 507, lr: 5.25e-03, grad_scale: 16.0 2023-06-23 20:56:28,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=948198.0, ans=0.1 2023-06-23 20:56:28,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=948198.0, ans=0.125 2023-06-23 20:57:05,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=948258.0, ans=0.1 2023-06-23 20:57:19,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=948318.0, ans=0.2 2023-06-23 20:57:46,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-23 20:57:56,420 INFO [train.py:996] (3/4) Epoch 6, batch 5600, loss[loss=0.1977, simple_loss=0.3003, pruned_loss=0.04753, over 21200.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.31, pruned_loss=0.06994, over 4287655.48 frames. ], batch size: 548, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 20:58:13,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=948498.0, ans=0.125 2023-06-23 20:58:32,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=948498.0, ans=0.125 2023-06-23 20:59:00,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=948618.0, ans=0.0 2023-06-23 20:59:01,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.636e+02 2.332e+02 2.800e+02 3.364e+02 5.770e+02, threshold=5.601e+02, percent-clipped=3.0 2023-06-23 20:59:44,385 INFO [train.py:996] (3/4) Epoch 6, batch 5650, loss[loss=0.2847, simple_loss=0.3774, pruned_loss=0.09603, over 21285.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3136, pruned_loss=0.07148, over 4286489.22 frames. ], batch size: 548, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:01:29,434 INFO [train.py:996] (3/4) Epoch 6, batch 5700, loss[loss=0.1942, simple_loss=0.2814, pruned_loss=0.05349, over 21633.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3124, pruned_loss=0.07288, over 4281030.42 frames. ], batch size: 230, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:02:21,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=949158.0, ans=0.125 2023-06-23 21:02:40,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=949218.0, ans=0.1 2023-06-23 21:02:41,755 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.515e+02 2.975e+02 3.453e+02 5.794e+02, threshold=5.950e+02, percent-clipped=1.0 2023-06-23 21:03:12,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=949278.0, ans=0.1 2023-06-23 21:03:15,059 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-23 21:03:31,977 INFO [train.py:996] (3/4) Epoch 6, batch 5750, loss[loss=0.1814, simple_loss=0.2796, pruned_loss=0.04158, over 21741.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3054, pruned_loss=0.07015, over 4281899.18 frames. ], batch size: 332, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:04:55,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=949578.0, ans=0.07 2023-06-23 21:05:22,447 INFO [train.py:996] (3/4) Epoch 6, batch 5800, loss[loss=0.2137, simple_loss=0.3099, pruned_loss=0.05877, over 21662.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3037, pruned_loss=0.06863, over 4276965.60 frames. ], batch size: 230, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:06:20,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=949758.0, ans=0.125 2023-06-23 21:06:27,699 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.707e+02 2.304e+02 2.799e+02 4.068e+02 6.558e+02, threshold=5.598e+02, percent-clipped=2.0 2023-06-23 21:06:29,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=949818.0, ans=0.125 2023-06-23 21:07:12,462 INFO [train.py:996] (3/4) Epoch 6, batch 5850, loss[loss=0.1987, simple_loss=0.3211, pruned_loss=0.03813, over 21168.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.3024, pruned_loss=0.06496, over 4276678.63 frames. ], batch size: 548, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:07:43,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.50 vs. limit=22.5 2023-06-23 21:08:45,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=950178.0, ans=0.125 2023-06-23 21:08:55,270 INFO [train.py:996] (3/4) Epoch 6, batch 5900, loss[loss=0.2194, simple_loss=0.2944, pruned_loss=0.07221, over 21594.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2958, pruned_loss=0.05966, over 4277051.85 frames. ], batch size: 471, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:09:01,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=950238.0, ans=0.125 2023-06-23 21:09:20,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2023-06-23 21:09:55,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=22.5 2023-06-23 21:09:57,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.381e+02 1.988e+02 2.407e+02 3.041e+02 4.833e+02, threshold=4.814e+02, percent-clipped=0.0 2023-06-23 21:10:41,834 INFO [train.py:996] (3/4) Epoch 6, batch 5950, loss[loss=0.1929, simple_loss=0.2626, pruned_loss=0.06157, over 22005.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.295, pruned_loss=0.0633, over 4285320.77 frames. ], batch size: 103, lr: 5.25e-03, grad_scale: 32.0 2023-06-23 21:12:23,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=950778.0, ans=0.1 2023-06-23 21:12:30,035 INFO [train.py:996] (3/4) Epoch 6, batch 6000, loss[loss=0.1904, simple_loss=0.2567, pruned_loss=0.06205, over 21655.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2914, pruned_loss=0.06734, over 4275735.17 frames. ], batch size: 264, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:12:30,035 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 21:12:53,047 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2596, simple_loss=0.3528, pruned_loss=0.08322, over 1796401.00 frames. 2023-06-23 21:12:53,048 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23273MB 2023-06-23 21:13:09,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=950838.0, ans=0.125 2023-06-23 21:13:32,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=950958.0, ans=0.125 2023-06-23 21:14:03,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.620e+02 2.865e+02 3.269e+02 5.211e+02, threshold=5.729e+02, percent-clipped=1.0 2023-06-23 21:14:20,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=951078.0, ans=0.2 2023-06-23 21:14:48,477 INFO [train.py:996] (3/4) Epoch 6, batch 6050, loss[loss=0.1689, simple_loss=0.2387, pruned_loss=0.04951, over 21421.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2867, pruned_loss=0.06815, over 4264274.22 frames. ], batch size: 195, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:14:55,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=951138.0, ans=0.125 2023-06-23 21:15:02,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=951138.0, ans=0.0 2023-06-23 21:15:42,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=951258.0, ans=0.2 2023-06-23 21:15:56,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=951318.0, ans=0.2 2023-06-23 21:16:27,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=951378.0, ans=0.2 2023-06-23 21:16:30,419 INFO [train.py:996] (3/4) Epoch 6, batch 6100, loss[loss=0.2454, simple_loss=0.3208, pruned_loss=0.085, over 21912.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2849, pruned_loss=0.06706, over 4267644.74 frames. ], batch size: 124, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:16:38,936 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-23 21:16:43,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=951438.0, ans=0.0 2023-06-23 21:17:40,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.589e+02 2.204e+02 2.422e+02 2.717e+02 3.811e+02, threshold=4.844e+02, percent-clipped=0.0 2023-06-23 21:18:13,214 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2023-06-23 21:18:18,509 INFO [train.py:996] (3/4) Epoch 6, batch 6150, loss[loss=0.2098, simple_loss=0.2869, pruned_loss=0.06637, over 21535.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.288, pruned_loss=0.0695, over 4273071.33 frames. ], batch size: 389, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:18:23,077 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.60 vs. limit=12.0 2023-06-23 21:18:49,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=951798.0, ans=0.125 2023-06-23 21:20:08,100 INFO [train.py:996] (3/4) Epoch 6, batch 6200, loss[loss=0.2292, simple_loss=0.3185, pruned_loss=0.06996, over 21710.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2929, pruned_loss=0.06993, over 4278833.22 frames. ], batch size: 414, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:20:40,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=952098.0, ans=0.2 2023-06-23 21:21:15,537 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 2.446e+02 2.781e+02 3.201e+02 6.151e+02, threshold=5.562e+02, percent-clipped=2.0 2023-06-23 21:21:36,607 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.29 vs. limit=5.0 2023-06-23 21:21:46,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=952278.0, ans=0.125 2023-06-23 21:21:58,208 INFO [train.py:996] (3/4) Epoch 6, batch 6250, loss[loss=0.2203, simple_loss=0.2747, pruned_loss=0.08289, over 20263.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2978, pruned_loss=0.06973, over 4276318.71 frames. ], batch size: 702, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:22:35,842 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:22:53,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=952458.0, ans=0.0 2023-06-23 21:23:02,083 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.15 vs. limit=15.0 2023-06-23 21:23:08,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=952518.0, ans=0.95 2023-06-23 21:23:43,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=22.5 2023-06-23 21:23:45,352 INFO [train.py:996] (3/4) Epoch 6, batch 6300, loss[loss=0.2191, simple_loss=0.2883, pruned_loss=0.07488, over 21772.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.301, pruned_loss=0.06913, over 4279463.98 frames. ], batch size: 247, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:23:45,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=952638.0, ans=0.1 2023-06-23 21:23:47,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=952638.0, ans=0.125 2023-06-23 21:24:20,495 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-23 21:24:23,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=952698.0, ans=0.125 2023-06-23 21:24:35,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=952758.0, ans=0.0 2023-06-23 21:24:42,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=952758.0, ans=0.125 2023-06-23 21:24:51,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=952818.0, ans=0.0 2023-06-23 21:24:57,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.792e+02 2.558e+02 3.046e+02 3.782e+02 6.709e+02, threshold=6.092e+02, percent-clipped=4.0 2023-06-23 21:25:10,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952818.0, ans=0.1 2023-06-23 21:25:34,547 INFO [train.py:996] (3/4) Epoch 6, batch 6350, loss[loss=0.2549, simple_loss=0.3327, pruned_loss=0.08861, over 21452.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3033, pruned_loss=0.07265, over 4283803.02 frames. ], batch size: 131, lr: 5.24e-03, grad_scale: 16.0 2023-06-23 21:26:12,807 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:26:14,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=952998.0, ans=0.0 2023-06-23 21:26:14,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=952998.0, ans=0.125 2023-06-23 21:26:35,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=953058.0, ans=0.125 2023-06-23 21:26:44,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=953058.0, ans=0.125 2023-06-23 21:26:45,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=953058.0, ans=0.1 2023-06-23 21:26:49,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=953118.0, ans=0.125 2023-06-23 21:27:29,898 INFO [train.py:996] (3/4) Epoch 6, batch 6400, loss[loss=0.2544, simple_loss=0.3306, pruned_loss=0.08916, over 21315.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3096, pruned_loss=0.0776, over 4284992.66 frames. ], batch size: 143, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:27:30,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=953238.0, ans=0.125 2023-06-23 21:28:42,593 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.766e+02 2.997e+02 3.346e+02 4.721e+02, threshold=5.994e+02, percent-clipped=0.0 2023-06-23 21:28:51,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953418.0, ans=0.1 2023-06-23 21:29:07,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=953478.0, ans=0.0 2023-06-23 21:29:24,311 INFO [train.py:996] (3/4) Epoch 6, batch 6450, loss[loss=0.2031, simple_loss=0.2822, pruned_loss=0.06197, over 21869.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3122, pruned_loss=0.07679, over 4287010.16 frames. ], batch size: 372, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:31:13,673 INFO [train.py:996] (3/4) Epoch 6, batch 6500, loss[loss=0.1956, simple_loss=0.2571, pruned_loss=0.06705, over 21393.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3052, pruned_loss=0.0745, over 4284963.59 frames. ], batch size: 131, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:31:34,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=953898.0, ans=0.125 2023-06-23 21:31:53,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=953898.0, ans=0.0 2023-06-23 21:32:18,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.973e+02 2.470e+02 2.695e+02 2.978e+02 5.209e+02, threshold=5.391e+02, percent-clipped=0.0 2023-06-23 21:32:24,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=954018.0, ans=0.1 2023-06-23 21:33:01,242 INFO [train.py:996] (3/4) Epoch 6, batch 6550, loss[loss=0.2373, simple_loss=0.309, pruned_loss=0.08281, over 21627.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3049, pruned_loss=0.07399, over 4284554.52 frames. ], batch size: 230, lr: 5.24e-03, grad_scale: 32.0 2023-06-23 21:33:14,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.69 vs. limit=6.0 2023-06-23 21:33:23,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=954198.0, ans=0.0 2023-06-23 21:34:06,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=954318.0, ans=0.0 2023-06-23 21:34:47,732 INFO [train.py:996] (3/4) Epoch 6, batch 6600, loss[loss=0.1861, simple_loss=0.241, pruned_loss=0.06557, over 21237.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2992, pruned_loss=0.07292, over 4267750.98 frames. ], batch size: 548, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:35:04,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=954498.0, ans=0.125 2023-06-23 21:35:29,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=954498.0, ans=0.125 2023-06-23 21:36:01,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.786e+02 2.286e+02 2.575e+02 2.928e+02 5.219e+02, threshold=5.150e+02, percent-clipped=0.0 2023-06-23 21:36:28,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=954678.0, ans=0.0 2023-06-23 21:36:35,401 INFO [train.py:996] (3/4) Epoch 6, batch 6650, loss[loss=0.1889, simple_loss=0.2524, pruned_loss=0.06274, over 21782.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2916, pruned_loss=0.07107, over 4267140.56 frames. ], batch size: 118, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:36:44,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=954738.0, ans=0.0 2023-06-23 21:37:02,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-23 21:37:09,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=954798.0, ans=0.125 2023-06-23 21:37:43,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=954918.0, ans=0.2 2023-06-23 21:37:47,381 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-23 21:38:18,829 INFO [train.py:996] (3/4) Epoch 6, batch 6700, loss[loss=0.1916, simple_loss=0.2641, pruned_loss=0.05953, over 21509.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2854, pruned_loss=0.0707, over 4273168.44 frames. ], batch size: 212, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:39:34,439 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.289e+02 2.607e+02 3.016e+02 4.316e+02, threshold=5.215e+02, percent-clipped=0.0 2023-06-23 21:39:35,737 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=12.0 2023-06-23 21:40:07,898 INFO [train.py:996] (3/4) Epoch 6, batch 6750, loss[loss=0.2447, simple_loss=0.3387, pruned_loss=0.07533, over 19817.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2828, pruned_loss=0.07106, over 4263083.08 frames. ], batch size: 703, lr: 5.23e-03, grad_scale: 8.0 2023-06-23 21:40:40,007 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.60 vs. limit=15.0 2023-06-23 21:41:10,279 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=12.0 2023-06-23 21:41:39,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=955578.0, ans=0.125 2023-06-23 21:41:48,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=955578.0, ans=0.125 2023-06-23 21:41:55,004 INFO [train.py:996] (3/4) Epoch 6, batch 6800, loss[loss=0.2051, simple_loss=0.2741, pruned_loss=0.06802, over 21854.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2855, pruned_loss=0.07241, over 4271377.04 frames. ], batch size: 107, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:43:03,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.510e+02 2.967e+02 3.494e+02 5.351e+02, threshold=5.935e+02, percent-clipped=1.0 2023-06-23 21:43:15,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=22.5 2023-06-23 21:43:42,665 INFO [train.py:996] (3/4) Epoch 6, batch 6850, loss[loss=0.243, simple_loss=0.3011, pruned_loss=0.09246, over 21802.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2852, pruned_loss=0.07306, over 4267454.36 frames. ], batch size: 414, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:43:51,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=955938.0, ans=0.125 2023-06-23 21:44:03,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=955998.0, ans=0.0 2023-06-23 21:44:24,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=955998.0, ans=0.125 2023-06-23 21:44:33,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=956058.0, ans=0.125 2023-06-23 21:45:16,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=956178.0, ans=0.07 2023-06-23 21:45:32,166 INFO [train.py:996] (3/4) Epoch 6, batch 6900, loss[loss=0.2226, simple_loss=0.2948, pruned_loss=0.07522, over 21526.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2876, pruned_loss=0.07355, over 4278306.89 frames. ], batch size: 131, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:46:17,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=956358.0, ans=0.125 2023-06-23 21:46:49,835 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.733e+02 2.526e+02 2.937e+02 3.629e+02 5.523e+02, threshold=5.874e+02, percent-clipped=0.0 2023-06-23 21:47:27,763 INFO [train.py:996] (3/4) Epoch 6, batch 6950, loss[loss=0.2856, simple_loss=0.348, pruned_loss=0.1116, over 21444.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2902, pruned_loss=0.07058, over 4276397.05 frames. ], batch size: 471, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:47:31,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=956538.0, ans=0.0 2023-06-23 21:48:14,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=956658.0, ans=0.125 2023-06-23 21:48:36,231 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:48:44,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=956718.0, ans=0.125 2023-06-23 21:49:14,664 INFO [train.py:996] (3/4) Epoch 6, batch 7000, loss[loss=0.2101, simple_loss=0.2746, pruned_loss=0.0728, over 21627.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2924, pruned_loss=0.07335, over 4277929.51 frames. ], batch size: 298, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:49:26,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-23 21:50:27,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.602e+02 2.936e+02 3.362e+02 6.122e+02, threshold=5.872e+02, percent-clipped=1.0 2023-06-23 21:51:05,574 INFO [train.py:996] (3/4) Epoch 6, batch 7050, loss[loss=0.1773, simple_loss=0.2467, pruned_loss=0.05397, over 16375.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2911, pruned_loss=0.07287, over 4266322.61 frames. ], batch size: 61, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:51:06,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=957138.0, ans=0.125 2023-06-23 21:51:06,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-23 21:51:23,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=957198.0, ans=0.1 2023-06-23 21:51:24,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.14 vs. limit=22.5 2023-06-23 21:51:55,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=15.0 2023-06-23 21:52:49,879 INFO [train.py:996] (3/4) Epoch 6, batch 7100, loss[loss=0.2354, simple_loss=0.3108, pruned_loss=0.07998, over 20694.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2964, pruned_loss=0.07499, over 4270574.93 frames. ], batch size: 607, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:52:50,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=957438.0, ans=0.125 2023-06-23 21:53:00,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=957438.0, ans=0.035 2023-06-23 21:53:40,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=957558.0, ans=0.125 2023-06-23 21:54:06,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.538e+02 2.381e+02 2.673e+02 3.454e+02 5.437e+02, threshold=5.346e+02, percent-clipped=0.0 2023-06-23 21:54:14,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=957618.0, ans=0.0 2023-06-23 21:54:18,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=957678.0, ans=0.125 2023-06-23 21:54:22,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=957678.0, ans=0.2 2023-06-23 21:54:22,470 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.05 vs. limit=10.0 2023-06-23 21:54:35,283 INFO [train.py:996] (3/4) Epoch 6, batch 7150, loss[loss=0.2444, simple_loss=0.3146, pruned_loss=0.08711, over 21394.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2941, pruned_loss=0.07273, over 4274972.05 frames. ], batch size: 549, lr: 5.23e-03, grad_scale: 16.0 2023-06-23 21:55:27,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=957858.0, ans=0.0 2023-06-23 21:55:51,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=957918.0, ans=0.125 2023-06-23 21:56:24,641 INFO [train.py:996] (3/4) Epoch 6, batch 7200, loss[loss=0.2349, simple_loss=0.3138, pruned_loss=0.07797, over 21764.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2974, pruned_loss=0.07561, over 4277482.11 frames. ], batch size: 102, lr: 5.23e-03, grad_scale: 32.0 2023-06-23 21:57:09,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=958158.0, ans=0.125 2023-06-23 21:57:46,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.518e+02 2.883e+02 3.559e+02 6.632e+02, threshold=5.766e+02, percent-clipped=3.0 2023-06-23 21:58:12,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.56 vs. limit=12.0 2023-06-23 21:58:13,503 INFO [train.py:996] (3/4) Epoch 6, batch 7250, loss[loss=0.2403, simple_loss=0.2779, pruned_loss=0.1014, over 21373.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2922, pruned_loss=0.07562, over 4277153.60 frames. ], batch size: 509, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:00:01,983 INFO [train.py:996] (3/4) Epoch 6, batch 7300, loss[loss=0.1872, simple_loss=0.2504, pruned_loss=0.062, over 21807.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2855, pruned_loss=0.07454, over 4278339.58 frames. ], batch size: 352, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:00:33,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=958698.0, ans=0.0 2023-06-23 22:01:24,348 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.461e+02 2.779e+02 3.106e+02 5.760e+02, threshold=5.558e+02, percent-clipped=0.0 2023-06-23 22:01:25,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=958818.0, ans=0.125 2023-06-23 22:01:51,241 INFO [train.py:996] (3/4) Epoch 6, batch 7350, loss[loss=0.2502, simple_loss=0.3167, pruned_loss=0.09179, over 21408.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2844, pruned_loss=0.07475, over 4272218.39 frames. ], batch size: 549, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:02:13,408 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-23 22:03:37,924 INFO [train.py:996] (3/4) Epoch 6, batch 7400, loss[loss=0.2627, simple_loss=0.3478, pruned_loss=0.08881, over 21619.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2917, pruned_loss=0.07575, over 4273018.45 frames. ], batch size: 441, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:03:59,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-23 22:04:45,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-23 22:04:50,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.57 vs. limit=22.5 2023-06-23 22:04:51,438 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:05:00,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.692e+02 3.073e+02 3.719e+02 6.060e+02, threshold=6.147e+02, percent-clipped=2.0 2023-06-23 22:05:08,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=959478.0, ans=0.125 2023-06-23 22:05:39,060 INFO [train.py:996] (3/4) Epoch 6, batch 7450, loss[loss=0.1981, simple_loss=0.2628, pruned_loss=0.06668, over 21780.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2903, pruned_loss=0.07464, over 4264755.76 frames. ], batch size: 124, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:05:39,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=959538.0, ans=0.125 2023-06-23 22:05:44,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=959538.0, ans=0.125 2023-06-23 22:06:13,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=959598.0, ans=0.0 2023-06-23 22:06:25,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=959658.0, ans=0.07 2023-06-23 22:06:43,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=959718.0, ans=0.125 2023-06-23 22:07:05,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=959778.0, ans=0.125 2023-06-23 22:07:30,945 INFO [train.py:996] (3/4) Epoch 6, batch 7500, loss[loss=0.2427, simple_loss=0.3495, pruned_loss=0.068, over 21555.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2945, pruned_loss=0.07641, over 4269613.82 frames. ], batch size: 263, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:08:02,905 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:08:32,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=959958.0, ans=0.125 2023-06-23 22:08:35,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=960018.0, ans=0.125 2023-06-23 22:08:44,086 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.946e+02 2.824e+02 3.431e+02 4.118e+02 7.261e+02, threshold=6.863e+02, percent-clipped=3.0 2023-06-23 22:09:09,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=960078.0, ans=0.0 2023-06-23 22:09:20,920 INFO [train.py:996] (3/4) Epoch 6, batch 7550, loss[loss=0.2067, simple_loss=0.3227, pruned_loss=0.04532, over 20784.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3025, pruned_loss=0.07539, over 4268809.64 frames. ], batch size: 608, lr: 5.22e-03, grad_scale: 16.0 2023-06-23 22:09:55,307 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:10:02,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.27 vs. limit=12.0 2023-06-23 22:10:22,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=960318.0, ans=0.0 2023-06-23 22:11:08,225 INFO [train.py:996] (3/4) Epoch 6, batch 7600, loss[loss=0.203, simple_loss=0.2765, pruned_loss=0.06474, over 21895.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3013, pruned_loss=0.0748, over 4276816.30 frames. ], batch size: 351, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:11:22,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=960438.0, ans=0.125 2023-06-23 22:12:05,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=960558.0, ans=0.125 2023-06-23 22:12:18,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.780e+02 2.489e+02 2.806e+02 3.400e+02 5.423e+02, threshold=5.611e+02, percent-clipped=0.0 2023-06-23 22:12:35,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=960678.0, ans=0.125 2023-06-23 22:12:53,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=960678.0, ans=0.2 2023-06-23 22:12:56,162 INFO [train.py:996] (3/4) Epoch 6, batch 7650, loss[loss=0.2199, simple_loss=0.2846, pruned_loss=0.07762, over 21597.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3012, pruned_loss=0.07675, over 4277749.32 frames. ], batch size: 212, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:13:14,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=960738.0, ans=0.125 2023-06-23 22:13:22,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960798.0, ans=0.1 2023-06-23 22:13:50,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=960858.0, ans=0.1 2023-06-23 22:13:59,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-23 22:14:30,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=15.0 2023-06-23 22:14:39,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=960978.0, ans=0.0 2023-06-23 22:14:51,520 INFO [train.py:996] (3/4) Epoch 6, batch 7700, loss[loss=0.1904, simple_loss=0.2914, pruned_loss=0.04469, over 20703.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3044, pruned_loss=0.07899, over 4280172.89 frames. ], batch size: 608, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:14:59,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=961038.0, ans=0.125 2023-06-23 22:16:06,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.621e+02 3.080e+02 3.761e+02 5.045e+02, threshold=6.161e+02, percent-clipped=0.0 2023-06-23 22:16:43,591 INFO [train.py:996] (3/4) Epoch 6, batch 7750, loss[loss=0.2182, simple_loss=0.3034, pruned_loss=0.06651, over 21367.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3081, pruned_loss=0.07819, over 4272525.08 frames. ], batch size: 131, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:17:07,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=961398.0, ans=0.125 2023-06-23 22:17:26,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.77 vs. limit=15.0 2023-06-23 22:17:49,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=961518.0, ans=0.125 2023-06-23 22:18:06,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=961518.0, ans=0.125 2023-06-23 22:18:30,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=961578.0, ans=0.125 2023-06-23 22:18:34,469 INFO [train.py:996] (3/4) Epoch 6, batch 7800, loss[loss=0.1868, simple_loss=0.2374, pruned_loss=0.06807, over 21360.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3097, pruned_loss=0.07883, over 4274239.17 frames. ], batch size: 131, lr: 5.22e-03, grad_scale: 32.0 2023-06-23 22:18:38,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=961638.0, ans=0.125 2023-06-23 22:18:41,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=961638.0, ans=0.0 2023-06-23 22:18:59,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.64 vs. limit=15.0 2023-06-23 22:19:45,757 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.845e+02 3.471e+02 4.135e+02 7.731e+02, threshold=6.941e+02, percent-clipped=4.0 2023-06-23 22:20:21,460 INFO [train.py:996] (3/4) Epoch 6, batch 7850, loss[loss=0.2134, simple_loss=0.2804, pruned_loss=0.0732, over 21795.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3018, pruned_loss=0.07746, over 4280197.29 frames. ], batch size: 372, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:20:55,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=962058.0, ans=0.125 2023-06-23 22:21:02,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=962058.0, ans=0.2 2023-06-23 22:22:11,770 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-23 22:22:15,672 INFO [train.py:996] (3/4) Epoch 6, batch 7900, loss[loss=0.2787, simple_loss=0.3747, pruned_loss=0.09138, over 21639.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2967, pruned_loss=0.07696, over 4276965.41 frames. ], batch size: 441, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:22:35,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=962298.0, ans=0.1 2023-06-23 22:22:35,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-23 22:22:38,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=962298.0, ans=0.2 2023-06-23 22:22:42,944 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.45 vs. limit=15.0 2023-06-23 22:22:55,253 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-23 22:23:10,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=962358.0, ans=0.125 2023-06-23 22:23:36,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.814e+02 3.173e+02 3.712e+02 6.452e+02, threshold=6.346e+02, percent-clipped=0.0 2023-06-23 22:23:55,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=962478.0, ans=0.0 2023-06-23 22:24:03,321 INFO [train.py:996] (3/4) Epoch 6, batch 7950, loss[loss=0.2446, simple_loss=0.3234, pruned_loss=0.08285, over 21874.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3043, pruned_loss=0.07703, over 4280736.82 frames. ], batch size: 371, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:24:43,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=22.5 2023-06-23 22:25:14,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.79 vs. limit=15.0 2023-06-23 22:25:32,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=962778.0, ans=0.125 2023-06-23 22:25:58,631 INFO [train.py:996] (3/4) Epoch 6, batch 8000, loss[loss=0.2641, simple_loss=0.3436, pruned_loss=0.09232, over 21182.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3079, pruned_loss=0.07934, over 4275008.53 frames. ], batch size: 143, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:26:01,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=962838.0, ans=0.125 2023-06-23 22:27:07,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=962958.0, ans=0.125 2023-06-23 22:27:20,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.660e+02 3.200e+02 3.986e+02 6.358e+02, threshold=6.400e+02, percent-clipped=1.0 2023-06-23 22:27:21,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-23 22:27:23,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=963018.0, ans=0.1 2023-06-23 22:27:51,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=963078.0, ans=0.0 2023-06-23 22:27:59,887 INFO [train.py:996] (3/4) Epoch 6, batch 8050, loss[loss=0.2191, simple_loss=0.2796, pruned_loss=0.07927, over 21244.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3105, pruned_loss=0.07908, over 4271993.83 frames. ], batch size: 607, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:28:07,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=963138.0, ans=0.0 2023-06-23 22:29:14,067 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-23 22:29:51,719 INFO [train.py:996] (3/4) Epoch 6, batch 8100, loss[loss=0.1997, simple_loss=0.2696, pruned_loss=0.06493, over 21659.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3092, pruned_loss=0.07977, over 4278435.18 frames. ], batch size: 263, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:30:20,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=963498.0, ans=0.1 2023-06-23 22:31:11,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=15.0 2023-06-23 22:31:23,067 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.897e+02 3.319e+02 4.086e+02 8.225e+02, threshold=6.637e+02, percent-clipped=1.0 2023-06-23 22:31:58,986 INFO [train.py:996] (3/4) Epoch 6, batch 8150, loss[loss=0.3279, simple_loss=0.4125, pruned_loss=0.1217, over 21484.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3166, pruned_loss=0.08097, over 4278988.80 frames. ], batch size: 507, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:32:27,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=963798.0, ans=10.0 2023-06-23 22:33:07,272 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.12 vs. limit=15.0 2023-06-23 22:33:39,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=963978.0, ans=0.2 2023-06-23 22:33:48,077 INFO [train.py:996] (3/4) Epoch 6, batch 8200, loss[loss=0.1852, simple_loss=0.2512, pruned_loss=0.05964, over 21347.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3096, pruned_loss=0.07862, over 4269131.83 frames. ], batch size: 131, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:34:42,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=964158.0, ans=0.0 2023-06-23 22:34:47,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=964158.0, ans=0.125 2023-06-23 22:34:52,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=964218.0, ans=0.125 2023-06-23 22:35:09,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.474e+02 2.845e+02 3.510e+02 6.334e+02, threshold=5.689e+02, percent-clipped=0.0 2023-06-23 22:35:39,823 INFO [train.py:996] (3/4) Epoch 6, batch 8250, loss[loss=0.2426, simple_loss=0.3457, pruned_loss=0.06975, over 20770.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3068, pruned_loss=0.07784, over 4270356.77 frames. ], batch size: 607, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:36:59,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=964518.0, ans=0.0 2023-06-23 22:37:04,841 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:37:30,675 INFO [train.py:996] (3/4) Epoch 6, batch 8300, loss[loss=0.1888, simple_loss=0.2704, pruned_loss=0.0536, over 21379.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3069, pruned_loss=0.07509, over 4270446.36 frames. ], batch size: 194, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:37:39,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=964638.0, ans=0.2 2023-06-23 22:37:45,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=964638.0, ans=0.1 2023-06-23 22:38:48,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=964818.0, ans=0.2 2023-06-23 22:38:49,164 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.813e+02 2.358e+02 2.866e+02 3.291e+02 6.256e+02, threshold=5.732e+02, percent-clipped=2.0 2023-06-23 22:39:19,227 INFO [train.py:996] (3/4) Epoch 6, batch 8350, loss[loss=0.2007, simple_loss=0.2916, pruned_loss=0.05486, over 21561.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3055, pruned_loss=0.0728, over 4274214.43 frames. ], batch size: 195, lr: 5.21e-03, grad_scale: 16.0 2023-06-23 22:40:01,379 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:40:22,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=965058.0, ans=0.125 2023-06-23 22:41:08,741 INFO [train.py:996] (3/4) Epoch 6, batch 8400, loss[loss=0.1639, simple_loss=0.2512, pruned_loss=0.0383, over 21162.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3044, pruned_loss=0.07163, over 4269954.65 frames. ], batch size: 176, lr: 5.21e-03, grad_scale: 32.0 2023-06-23 22:41:12,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=965238.0, ans=0.125 2023-06-23 22:41:21,844 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-23 22:42:14,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=965418.0, ans=0.125 2023-06-23 22:42:27,772 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.585e+02 2.294e+02 2.573e+02 3.024e+02 4.553e+02, threshold=5.145e+02, percent-clipped=0.0 2023-06-23 22:42:44,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=965478.0, ans=0.0 2023-06-23 22:42:46,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=965478.0, ans=0.1 2023-06-23 22:42:55,766 INFO [train.py:996] (3/4) Epoch 6, batch 8450, loss[loss=0.2386, simple_loss=0.2985, pruned_loss=0.08937, over 21650.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3016, pruned_loss=0.07048, over 4277481.93 frames. ], batch size: 389, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:43:56,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-23 22:44:44,906 INFO [train.py:996] (3/4) Epoch 6, batch 8500, loss[loss=0.1938, simple_loss=0.2598, pruned_loss=0.06396, over 21429.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2989, pruned_loss=0.07199, over 4283476.81 frames. ], batch size: 194, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:45:38,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=965958.0, ans=0.0 2023-06-23 22:45:38,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=965958.0, ans=0.125 2023-06-23 22:46:13,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.833e+02 3.387e+02 4.039e+02 6.147e+02, threshold=6.774e+02, percent-clipped=7.0 2023-06-23 22:46:36,910 INFO [train.py:996] (3/4) Epoch 6, batch 8550, loss[loss=0.2378, simple_loss=0.2954, pruned_loss=0.09013, over 20053.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3017, pruned_loss=0.07481, over 4284022.81 frames. ], batch size: 702, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:47:31,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=966258.0, ans=0.125 2023-06-23 22:48:34,745 INFO [train.py:996] (3/4) Epoch 6, batch 8600, loss[loss=0.2864, simple_loss=0.3544, pruned_loss=0.1092, over 21431.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.307, pruned_loss=0.07703, over 4280375.99 frames. ], batch size: 471, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:48:50,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=966438.0, ans=0.05 2023-06-23 22:49:18,369 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.78 vs. limit=6.0 2023-06-23 22:49:24,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=966558.0, ans=0.2 2023-06-23 22:49:47,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-23 22:49:52,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=966618.0, ans=0.125 2023-06-23 22:49:57,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.875e+02 3.260e+02 4.247e+02 6.190e+02, threshold=6.520e+02, percent-clipped=0.0 2023-06-23 22:50:01,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=966678.0, ans=0.0 2023-06-23 22:50:30,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=966738.0, ans=0.0 2023-06-23 22:50:31,074 INFO [train.py:996] (3/4) Epoch 6, batch 8650, loss[loss=0.2633, simple_loss=0.3517, pruned_loss=0.0874, over 21477.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3132, pruned_loss=0.07843, over 4278664.83 frames. ], batch size: 507, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:50:33,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=966738.0, ans=22.5 2023-06-23 22:50:57,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=966798.0, ans=0.0 2023-06-23 22:51:07,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=966798.0, ans=0.0 2023-06-23 22:51:09,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=966858.0, ans=0.2 2023-06-23 22:51:47,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=966918.0, ans=0.0 2023-06-23 22:51:50,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=966978.0, ans=0.0 2023-06-23 22:51:54,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=966978.0, ans=0.2 2023-06-23 22:52:13,203 INFO [train.py:996] (3/4) Epoch 6, batch 8700, loss[loss=0.2203, simple_loss=0.2744, pruned_loss=0.08311, over 21231.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3043, pruned_loss=0.07516, over 4280525.52 frames. ], batch size: 471, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:52:35,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=967098.0, ans=0.1 2023-06-23 22:52:58,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=967158.0, ans=0.125 2023-06-23 22:53:27,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=967218.0, ans=0.125 2023-06-23 22:53:33,441 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 2.283e+02 2.590e+02 3.172e+02 4.476e+02, threshold=5.179e+02, percent-clipped=0.0 2023-06-23 22:54:08,943 INFO [train.py:996] (3/4) Epoch 6, batch 8750, loss[loss=0.2233, simple_loss=0.2844, pruned_loss=0.08114, over 21310.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2999, pruned_loss=0.07531, over 4278154.91 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 22:54:15,965 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.71 vs. limit=22.5 2023-06-23 22:54:37,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=967398.0, ans=0.125 2023-06-23 22:54:46,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=967398.0, ans=0.2 2023-06-23 22:55:19,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=967518.0, ans=0.1 2023-06-23 22:56:02,191 INFO [train.py:996] (3/4) Epoch 6, batch 8800, loss[loss=0.2837, simple_loss=0.3648, pruned_loss=0.1013, over 21730.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3076, pruned_loss=0.07807, over 4278689.12 frames. ], batch size: 441, lr: 5.20e-03, grad_scale: 32.0 2023-06-23 22:56:25,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=967698.0, ans=0.125 2023-06-23 22:57:04,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=967818.0, ans=0.125 2023-06-23 22:57:18,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=967818.0, ans=0.2 2023-06-23 22:57:28,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.723e+02 3.088e+02 3.591e+02 5.183e+02, threshold=6.177e+02, percent-clipped=1.0 2023-06-23 22:57:56,315 INFO [train.py:996] (3/4) Epoch 6, batch 8850, loss[loss=0.2168, simple_loss=0.2961, pruned_loss=0.06873, over 21555.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3146, pruned_loss=0.07961, over 4271753.76 frames. ], batch size: 230, lr: 5.20e-03, grad_scale: 32.0 2023-06-23 22:58:40,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=968058.0, ans=0.125 2023-06-23 22:59:46,016 INFO [train.py:996] (3/4) Epoch 6, batch 8900, loss[loss=0.2197, simple_loss=0.2784, pruned_loss=0.0805, over 21521.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3102, pruned_loss=0.07806, over 4268260.73 frames. ], batch size: 441, lr: 5.20e-03, grad_scale: 32.0 2023-06-23 22:59:53,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=968238.0, ans=0.125 2023-06-23 23:00:20,373 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-06-23 23:00:28,037 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-23 23:00:37,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=968358.0, ans=0.125 2023-06-23 23:01:18,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.656e+02 3.141e+02 3.730e+02 7.900e+02, threshold=6.282e+02, percent-clipped=6.0 2023-06-23 23:01:31,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=968478.0, ans=0.125 2023-06-23 23:01:39,330 INFO [train.py:996] (3/4) Epoch 6, batch 8950, loss[loss=0.1996, simple_loss=0.2894, pruned_loss=0.05486, over 21605.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3075, pruned_loss=0.07725, over 4269270.14 frames. ], batch size: 263, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 23:01:39,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=968538.0, ans=0.125 2023-06-23 23:02:23,810 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:03:29,115 INFO [train.py:996] (3/4) Epoch 6, batch 9000, loss[loss=0.2029, simple_loss=0.2631, pruned_loss=0.07133, over 21168.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3021, pruned_loss=0.07697, over 4276769.67 frames. ], batch size: 176, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 23:03:29,116 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-23 23:03:48,707 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2652, simple_loss=0.3551, pruned_loss=0.08764, over 1796401.00 frames. 2023-06-23 23:03:48,708 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23336MB 2023-06-23 23:03:56,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=968838.0, ans=0.125 2023-06-23 23:03:56,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=968838.0, ans=0.125 2023-06-23 23:04:16,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=968898.0, ans=0.125 2023-06-23 23:04:39,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=968958.0, ans=0.125 2023-06-23 23:05:12,831 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.595e+02 2.551e+02 3.018e+02 3.495e+02 6.048e+02, threshold=6.037e+02, percent-clipped=0.0 2023-06-23 23:05:45,388 INFO [train.py:996] (3/4) Epoch 6, batch 9050, loss[loss=0.215, simple_loss=0.2952, pruned_loss=0.06744, over 21685.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2987, pruned_loss=0.0735, over 4275960.02 frames. ], batch size: 298, lr: 5.20e-03, grad_scale: 16.0 2023-06-23 23:06:23,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=969198.0, ans=0.125 2023-06-23 23:07:30,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=969378.0, ans=0.125 2023-06-23 23:07:39,249 INFO [train.py:996] (3/4) Epoch 6, batch 9100, loss[loss=0.2264, simple_loss=0.3253, pruned_loss=0.06379, over 21732.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3057, pruned_loss=0.07627, over 4279861.16 frames. ], batch size: 351, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:08:40,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-23 23:09:04,184 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 2.470e+02 2.760e+02 3.335e+02 5.659e+02, threshold=5.519e+02, percent-clipped=0.0 2023-06-23 23:09:05,224 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-23 23:09:27,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=969678.0, ans=0.125 2023-06-23 23:09:30,899 INFO [train.py:996] (3/4) Epoch 6, batch 9150, loss[loss=0.2301, simple_loss=0.3068, pruned_loss=0.07669, over 21433.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.31, pruned_loss=0.07486, over 4272612.46 frames. ], batch size: 160, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:10:30,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=969858.0, ans=0.0 2023-06-23 23:10:48,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=969918.0, ans=0.125 2023-06-23 23:11:22,066 INFO [train.py:996] (3/4) Epoch 6, batch 9200, loss[loss=0.3191, simple_loss=0.3785, pruned_loss=0.1299, over 21375.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3104, pruned_loss=0.0737, over 4276201.48 frames. ], batch size: 507, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:11:54,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=970098.0, ans=0.1 2023-06-23 23:11:58,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=970098.0, ans=0.125 2023-06-23 23:12:20,484 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-23 23:12:24,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=970158.0, ans=0.125 2023-06-23 23:12:51,023 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.565e+02 2.927e+02 3.982e+02 7.343e+02, threshold=5.853e+02, percent-clipped=8.0 2023-06-23 23:13:17,985 INFO [train.py:996] (3/4) Epoch 6, batch 9250, loss[loss=0.2227, simple_loss=0.2881, pruned_loss=0.07862, over 21433.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3156, pruned_loss=0.07615, over 4267878.65 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:13:53,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=970398.0, ans=0.5 2023-06-23 23:13:53,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=970398.0, ans=0.05 2023-06-23 23:15:13,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=970638.0, ans=0.125 2023-06-23 23:15:15,151 INFO [train.py:996] (3/4) Epoch 6, batch 9300, loss[loss=0.228, simple_loss=0.3031, pruned_loss=0.07642, over 21275.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3101, pruned_loss=0.07575, over 4265789.67 frames. ], batch size: 159, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:15:20,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.47 vs. limit=15.0 2023-06-23 23:15:26,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=970638.0, ans=10.0 2023-06-23 23:16:33,204 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.705e+02 3.300e+02 3.579e+02 5.908e+02, threshold=6.601e+02, percent-clipped=1.0 2023-06-23 23:16:37,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=970878.0, ans=0.0 2023-06-23 23:17:06,338 INFO [train.py:996] (3/4) Epoch 6, batch 9350, loss[loss=0.2517, simple_loss=0.3338, pruned_loss=0.08483, over 21879.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3137, pruned_loss=0.07701, over 4267225.49 frames. ], batch size: 371, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:18:28,342 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.78 vs. limit=15.0 2023-06-23 23:18:57,430 INFO [train.py:996] (3/4) Epoch 6, batch 9400, loss[loss=0.2165, simple_loss=0.2787, pruned_loss=0.07716, over 21247.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3161, pruned_loss=0.07801, over 4262079.20 frames. ], batch size: 159, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:18:57,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=971238.0, ans=0.2 2023-06-23 23:20:25,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.477e+02 2.813e+02 3.524e+02 8.030e+02, threshold=5.626e+02, percent-clipped=3.0 2023-06-23 23:20:46,096 INFO [train.py:996] (3/4) Epoch 6, batch 9450, loss[loss=0.2123, simple_loss=0.2706, pruned_loss=0.07702, over 21213.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3077, pruned_loss=0.07685, over 4258321.08 frames. ], batch size: 159, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:21:06,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=971598.0, ans=0.0 2023-06-23 23:21:23,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=971598.0, ans=0.125 2023-06-23 23:21:42,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=971658.0, ans=0.5 2023-06-23 23:21:48,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.50 vs. limit=15.0 2023-06-23 23:21:49,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=971718.0, ans=0.2 2023-06-23 23:22:12,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=971778.0, ans=0.025 2023-06-23 23:22:26,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=971778.0, ans=0.0 2023-06-23 23:22:29,377 INFO [train.py:996] (3/4) Epoch 6, batch 9500, loss[loss=0.1781, simple_loss=0.2552, pruned_loss=0.05044, over 21329.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3001, pruned_loss=0.07525, over 4248049.88 frames. ], batch size: 176, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:23:55,315 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.823e+02 2.481e+02 2.768e+02 3.385e+02 5.932e+02, threshold=5.537e+02, percent-clipped=1.0 2023-06-23 23:24:05,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-23 23:24:20,118 INFO [train.py:996] (3/4) Epoch 6, batch 9550, loss[loss=0.2781, simple_loss=0.3497, pruned_loss=0.1033, over 21606.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3049, pruned_loss=0.07764, over 4254442.43 frames. ], batch size: 389, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:25:03,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=972258.0, ans=0.0 2023-06-23 23:25:06,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=972258.0, ans=0.1 2023-06-23 23:25:06,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=972258.0, ans=0.1 2023-06-23 23:25:59,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=972378.0, ans=0.125 2023-06-23 23:26:02,057 INFO [train.py:996] (3/4) Epoch 6, batch 9600, loss[loss=0.2112, simple_loss=0.2804, pruned_loss=0.07103, over 21882.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3081, pruned_loss=0.0799, over 4267046.72 frames. ], batch size: 298, lr: 5.19e-03, grad_scale: 32.0 2023-06-23 23:26:14,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=972438.0, ans=0.1 2023-06-23 23:26:30,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=972498.0, ans=0.125 2023-06-23 23:26:36,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-23 23:26:56,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=972558.0, ans=0.125 2023-06-23 23:27:29,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=972618.0, ans=0.2 2023-06-23 23:27:32,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.542e+02 2.834e+02 3.285e+02 4.885e+02, threshold=5.668e+02, percent-clipped=0.0 2023-06-23 23:27:33,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=972678.0, ans=0.125 2023-06-23 23:27:43,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=972678.0, ans=0.035 2023-06-23 23:28:01,869 INFO [train.py:996] (3/4) Epoch 6, batch 9650, loss[loss=0.2508, simple_loss=0.3252, pruned_loss=0.0882, over 21715.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3086, pruned_loss=0.07968, over 4272183.00 frames. ], batch size: 351, lr: 5.19e-03, grad_scale: 16.0 2023-06-23 23:28:48,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=972858.0, ans=15.0 2023-06-23 23:29:50,944 INFO [train.py:996] (3/4) Epoch 6, batch 9700, loss[loss=0.2001, simple_loss=0.2833, pruned_loss=0.05846, over 21782.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3109, pruned_loss=0.07935, over 4279492.08 frames. ], batch size: 298, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:30:32,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=973158.0, ans=0.2 2023-06-23 23:30:57,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=973218.0, ans=0.2 2023-06-23 23:30:59,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=973218.0, ans=0.125 2023-06-23 23:31:10,721 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.422e+02 2.744e+02 3.326e+02 5.586e+02, threshold=5.488e+02, percent-clipped=0.0 2023-06-23 23:31:16,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=973278.0, ans=0.125 2023-06-23 23:31:32,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=973338.0, ans=0.04949747468305833 2023-06-23 23:31:38,368 INFO [train.py:996] (3/4) Epoch 6, batch 9750, loss[loss=0.2653, simple_loss=0.3475, pruned_loss=0.0915, over 21869.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3047, pruned_loss=0.07816, over 4273491.48 frames. ], batch size: 107, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:31:46,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.21 vs. limit=22.5 2023-06-23 23:32:41,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=973518.0, ans=0.125 2023-06-23 23:33:19,468 INFO [train.py:996] (3/4) Epoch 6, batch 9800, loss[loss=0.2146, simple_loss=0.285, pruned_loss=0.07216, over 21655.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3041, pruned_loss=0.07777, over 4271842.44 frames. ], batch size: 230, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:33:53,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=973698.0, ans=0.2 2023-06-23 23:34:14,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=973758.0, ans=0.125 2023-06-23 23:34:16,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=973758.0, ans=0.125 2023-06-23 23:34:42,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=973818.0, ans=0.0 2023-06-23 23:34:45,284 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.591e+02 2.983e+02 3.638e+02 9.651e+02, threshold=5.966e+02, percent-clipped=4.0 2023-06-23 23:34:52,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=973878.0, ans=0.125 2023-06-23 23:35:07,684 INFO [train.py:996] (3/4) Epoch 6, batch 9850, loss[loss=0.1934, simple_loss=0.2592, pruned_loss=0.06376, over 21791.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3008, pruned_loss=0.0775, over 4276571.06 frames. ], batch size: 102, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:35:08,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=973938.0, ans=0.0 2023-06-23 23:35:32,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=973998.0, ans=0.125 2023-06-23 23:35:41,268 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:36:29,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=974118.0, ans=0.125 2023-06-23 23:36:32,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=974178.0, ans=0.0 2023-06-23 23:36:57,016 INFO [train.py:996] (3/4) Epoch 6, batch 9900, loss[loss=0.2134, simple_loss=0.2897, pruned_loss=0.06853, over 21330.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2972, pruned_loss=0.07696, over 4261877.68 frames. ], batch size: 131, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:37:02,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=974238.0, ans=0.125 2023-06-23 23:37:03,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-23 23:37:15,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=974238.0, ans=0.125 2023-06-23 23:37:20,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=974298.0, ans=0.0 2023-06-23 23:37:24,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=974298.0, ans=0.0 2023-06-23 23:38:13,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-23 23:38:14,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=974418.0, ans=0.2 2023-06-23 23:38:23,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.567e+02 2.955e+02 3.451e+02 4.751e+02, threshold=5.911e+02, percent-clipped=0.0 2023-06-23 23:38:46,962 INFO [train.py:996] (3/4) Epoch 6, batch 9950, loss[loss=0.2058, simple_loss=0.2669, pruned_loss=0.07239, over 21578.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2991, pruned_loss=0.0792, over 4267759.56 frames. ], batch size: 263, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:38:55,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=974538.0, ans=0.125 2023-06-23 23:38:57,308 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-23 23:39:07,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=974538.0, ans=0.015 2023-06-23 23:39:10,115 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-23 23:39:41,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=974658.0, ans=0.2 2023-06-23 23:39:46,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=974658.0, ans=0.0 2023-06-23 23:39:52,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=974658.0, ans=0.1 2023-06-23 23:40:30,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=974778.0, ans=0.1 2023-06-23 23:40:43,752 INFO [train.py:996] (3/4) Epoch 6, batch 10000, loss[loss=0.2129, simple_loss=0.2795, pruned_loss=0.07316, over 21164.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2941, pruned_loss=0.07764, over 4266925.47 frames. ], batch size: 143, lr: 5.18e-03, grad_scale: 32.0 2023-06-23 23:40:51,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=974838.0, ans=0.125 2023-06-23 23:41:50,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=975018.0, ans=0.0 2023-06-23 23:42:02,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=975018.0, ans=0.125 2023-06-23 23:42:10,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.477e+02 2.945e+02 3.555e+02 6.332e+02, threshold=5.891e+02, percent-clipped=1.0 2023-06-23 23:42:28,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=975078.0, ans=0.125 2023-06-23 23:42:34,449 INFO [train.py:996] (3/4) Epoch 6, batch 10050, loss[loss=0.1942, simple_loss=0.2667, pruned_loss=0.06086, over 21399.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2958, pruned_loss=0.07797, over 4266766.05 frames. ], batch size: 194, lr: 5.18e-03, grad_scale: 32.0 2023-06-23 23:42:57,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.55 vs. limit=15.0 2023-06-23 23:43:00,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=975198.0, ans=0.0 2023-06-23 23:43:31,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-23 23:43:49,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=975318.0, ans=0.125 2023-06-23 23:44:25,428 INFO [train.py:996] (3/4) Epoch 6, batch 10100, loss[loss=0.2294, simple_loss=0.3094, pruned_loss=0.0747, over 21759.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2945, pruned_loss=0.07625, over 4265418.22 frames. ], batch size: 351, lr: 5.18e-03, grad_scale: 32.0 2023-06-23 23:44:55,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.43 vs. limit=10.0 2023-06-23 23:45:12,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=975558.0, ans=0.125 2023-06-23 23:45:14,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=975558.0, ans=0.1 2023-06-23 23:45:17,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=975558.0, ans=0.1 2023-06-23 23:45:59,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.533e+02 2.969e+02 3.783e+02 6.881e+02, threshold=5.937e+02, percent-clipped=1.0 2023-06-23 23:46:21,404 INFO [train.py:996] (3/4) Epoch 6, batch 10150, loss[loss=0.2519, simple_loss=0.3207, pruned_loss=0.09151, over 21513.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3005, pruned_loss=0.07824, over 4265737.38 frames. ], batch size: 389, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:47:06,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=975858.0, ans=0.125 2023-06-23 23:47:21,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=975858.0, ans=15.0 2023-06-23 23:47:36,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=975918.0, ans=6.0 2023-06-23 23:48:09,605 INFO [train.py:996] (3/4) Epoch 6, batch 10200, loss[loss=0.2125, simple_loss=0.3002, pruned_loss=0.06243, over 21839.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2993, pruned_loss=0.0762, over 4256967.47 frames. ], batch size: 317, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:48:35,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=976098.0, ans=0.1 2023-06-23 23:48:44,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=976098.0, ans=0.125 2023-06-23 23:48:58,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=976158.0, ans=0.125 2023-06-23 23:49:01,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-23 23:49:38,134 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.173e+02 2.583e+02 3.025e+02 4.269e+02, threshold=5.166e+02, percent-clipped=0.0 2023-06-23 23:49:56,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=976278.0, ans=0.04949747468305833 2023-06-23 23:49:59,505 INFO [train.py:996] (3/4) Epoch 6, batch 10250, loss[loss=0.1878, simple_loss=0.2577, pruned_loss=0.05894, over 21808.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2944, pruned_loss=0.07043, over 4258751.65 frames. ], batch size: 102, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:50:20,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.86 vs. limit=22.5 2023-06-23 23:51:32,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=976578.0, ans=0.125 2023-06-23 23:51:58,302 INFO [train.py:996] (3/4) Epoch 6, batch 10300, loss[loss=0.2462, simple_loss=0.3394, pruned_loss=0.07651, over 21892.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2988, pruned_loss=0.07214, over 4265328.30 frames. ], batch size: 372, lr: 5.18e-03, grad_scale: 16.0 2023-06-23 23:52:06,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=976638.0, ans=0.125 2023-06-23 23:52:08,262 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:52:22,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=976698.0, ans=0.125 2023-06-23 23:52:24,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=976698.0, ans=0.0 2023-06-23 23:52:35,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=976698.0, ans=0.125 2023-06-23 23:53:25,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=976878.0, ans=0.0 2023-06-23 23:53:28,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.542e+02 2.521e+02 2.843e+02 3.478e+02 5.751e+02, threshold=5.686e+02, percent-clipped=3.0 2023-06-23 23:53:52,270 INFO [train.py:996] (3/4) Epoch 6, batch 10350, loss[loss=0.1899, simple_loss=0.2693, pruned_loss=0.05527, over 21658.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2994, pruned_loss=0.07227, over 4263065.89 frames. ], batch size: 263, lr: 5.17e-03, grad_scale: 16.0 2023-06-23 23:53:55,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=12.0 2023-06-23 23:54:40,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=977058.0, ans=0.0 2023-06-23 23:55:23,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=22.5 2023-06-23 23:55:27,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.67 vs. limit=15.0 2023-06-23 23:55:43,943 INFO [train.py:996] (3/4) Epoch 6, batch 10400, loss[loss=0.1799, simple_loss=0.251, pruned_loss=0.05443, over 21630.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2949, pruned_loss=0.07115, over 4254235.55 frames. ], batch size: 263, lr: 5.17e-03, grad_scale: 32.0 2023-06-23 23:56:04,972 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-23 23:56:11,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=977298.0, ans=0.0 2023-06-23 23:56:24,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=977298.0, ans=0.125 2023-06-23 23:56:43,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-23 23:57:20,292 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.786e+02 3.233e+02 3.708e+02 5.830e+02, threshold=6.465e+02, percent-clipped=3.0 2023-06-23 23:57:22,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=977478.0, ans=0.1 2023-06-23 23:57:23,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-23 23:57:41,114 INFO [train.py:996] (3/4) Epoch 6, batch 10450, loss[loss=0.2003, simple_loss=0.2709, pruned_loss=0.06485, over 16985.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2987, pruned_loss=0.07384, over 4257367.82 frames. ], batch size: 61, lr: 5.17e-03, grad_scale: 32.0 2023-06-23 23:58:04,207 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-23 23:58:10,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=977598.0, ans=0.0 2023-06-23 23:58:52,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-23 23:58:53,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=977718.0, ans=0.125 2023-06-23 23:59:02,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=977718.0, ans=0.125 2023-06-23 23:59:05,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=977718.0, ans=0.0 2023-06-23 23:59:12,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-23 23:59:30,733 INFO [train.py:996] (3/4) Epoch 6, batch 10500, loss[loss=0.2117, simple_loss=0.2811, pruned_loss=0.07117, over 21173.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2982, pruned_loss=0.07268, over 4253405.12 frames. ], batch size: 548, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:00:04,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=977898.0, ans=0.2 2023-06-24 00:00:27,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=977958.0, ans=0.1 2023-06-24 00:00:45,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=978018.0, ans=0.125 2023-06-24 00:00:59,799 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.398e+02 2.689e+02 3.123e+02 4.066e+02, threshold=5.379e+02, percent-clipped=0.0 2023-06-24 00:01:19,034 INFO [train.py:996] (3/4) Epoch 6, batch 10550, loss[loss=0.2147, simple_loss=0.271, pruned_loss=0.07917, over 21335.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2929, pruned_loss=0.07274, over 4241833.36 frames. ], batch size: 473, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:02:04,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=978258.0, ans=0.125 2023-06-24 00:03:09,231 INFO [train.py:996] (3/4) Epoch 6, batch 10600, loss[loss=0.2262, simple_loss=0.3213, pruned_loss=0.06558, over 19903.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2881, pruned_loss=0.07141, over 4246136.16 frames. ], batch size: 703, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:03:23,668 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-24 00:04:19,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=978618.0, ans=0.04949747468305833 2023-06-24 00:04:30,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=978618.0, ans=0.125 2023-06-24 00:04:40,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=978618.0, ans=0.125 2023-06-24 00:04:47,264 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.753e+02 2.546e+02 2.981e+02 3.597e+02 7.487e+02, threshold=5.961e+02, percent-clipped=2.0 2023-06-24 00:04:59,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-24 00:05:12,643 INFO [train.py:996] (3/4) Epoch 6, batch 10650, loss[loss=0.1518, simple_loss=0.2225, pruned_loss=0.04056, over 21214.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2911, pruned_loss=0.06984, over 4245564.32 frames. ], batch size: 159, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:05:22,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=978738.0, ans=0.125 2023-06-24 00:06:04,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=978858.0, ans=0.5 2023-06-24 00:06:20,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=978918.0, ans=0.125 2023-06-24 00:06:20,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-24 00:06:31,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-24 00:07:03,068 INFO [train.py:996] (3/4) Epoch 6, batch 10700, loss[loss=0.2501, simple_loss=0.3249, pruned_loss=0.0877, over 21911.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2901, pruned_loss=0.0698, over 4244761.19 frames. ], batch size: 372, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:07:21,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=979098.0, ans=0.1 2023-06-24 00:07:35,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=979098.0, ans=0.1 2023-06-24 00:07:44,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=979158.0, ans=0.125 2023-06-24 00:08:08,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=979218.0, ans=0.0 2023-06-24 00:08:35,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.562e+02 2.930e+02 3.343e+02 5.418e+02, threshold=5.860e+02, percent-clipped=0.0 2023-06-24 00:08:55,529 INFO [train.py:996] (3/4) Epoch 6, batch 10750, loss[loss=0.2426, simple_loss=0.3318, pruned_loss=0.07666, over 21795.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3017, pruned_loss=0.07467, over 4255673.58 frames. ], batch size: 282, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:09:19,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=979398.0, ans=0.05 2023-06-24 00:09:39,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=979458.0, ans=0.125 2023-06-24 00:09:53,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=979458.0, ans=0.04949747468305833 2023-06-24 00:10:42,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=979578.0, ans=0.2 2023-06-24 00:10:47,828 INFO [train.py:996] (3/4) Epoch 6, batch 10800, loss[loss=0.2369, simple_loss=0.3147, pruned_loss=0.0795, over 21822.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3065, pruned_loss=0.07528, over 4260278.19 frames. ], batch size: 282, lr: 5.17e-03, grad_scale: 32.0 2023-06-24 00:10:50,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=979638.0, ans=0.125 2023-06-24 00:11:28,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=979698.0, ans=0.0 2023-06-24 00:11:49,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-06-24 00:12:01,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=979818.0, ans=0.1 2023-06-24 00:12:24,836 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.761e+02 3.249e+02 3.882e+02 5.958e+02, threshold=6.498e+02, percent-clipped=1.0 2023-06-24 00:12:31,317 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2023-06-24 00:12:37,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=979938.0, ans=0.0 2023-06-24 00:12:44,079 INFO [train.py:996] (3/4) Epoch 6, batch 10850, loss[loss=0.2608, simple_loss=0.3114, pruned_loss=0.1051, over 21486.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3086, pruned_loss=0.07628, over 4260961.01 frames. ], batch size: 509, lr: 5.17e-03, grad_scale: 32.0 2023-06-24 00:12:47,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=979938.0, ans=0.125 2023-06-24 00:12:55,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=979938.0, ans=0.1 2023-06-24 00:13:17,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=979998.0, ans=0.125 2023-06-24 00:13:33,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=980058.0, ans=0.0 2023-06-24 00:14:11,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=980178.0, ans=0.125 2023-06-24 00:14:19,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=980178.0, ans=0.1 2023-06-24 00:14:27,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=980178.0, ans=0.125 2023-06-24 00:14:30,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=980178.0, ans=0.125 2023-06-24 00:14:35,108 INFO [train.py:996] (3/4) Epoch 6, batch 10900, loss[loss=0.2134, simple_loss=0.2878, pruned_loss=0.06952, over 21730.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3007, pruned_loss=0.07414, over 4246008.30 frames. ], batch size: 316, lr: 5.17e-03, grad_scale: 16.0 2023-06-24 00:15:06,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=980298.0, ans=0.125 2023-06-24 00:15:41,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=980418.0, ans=0.1 2023-06-24 00:15:57,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=980418.0, ans=0.125 2023-06-24 00:16:02,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=980478.0, ans=0.125 2023-06-24 00:16:05,803 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.411e+02 2.776e+02 2.994e+02 5.292e+02, threshold=5.553e+02, percent-clipped=0.0 2023-06-24 00:16:20,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-06-24 00:16:22,915 INFO [train.py:996] (3/4) Epoch 6, batch 10950, loss[loss=0.2036, simple_loss=0.2701, pruned_loss=0.06858, over 21473.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2956, pruned_loss=0.07233, over 4247742.86 frames. ], batch size: 389, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:17:02,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=980598.0, ans=0.1 2023-06-24 00:17:06,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=12.0 2023-06-24 00:18:13,133 INFO [train.py:996] (3/4) Epoch 6, batch 11000, loss[loss=0.2662, simple_loss=0.3395, pruned_loss=0.09642, over 21732.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2946, pruned_loss=0.07352, over 4254932.68 frames. ], batch size: 112, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:18:39,132 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.24 vs. limit=15.0 2023-06-24 00:19:41,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-24 00:19:44,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=981078.0, ans=0.0 2023-06-24 00:19:45,482 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.830e+02 2.423e+02 2.754e+02 3.301e+02 6.173e+02, threshold=5.508e+02, percent-clipped=2.0 2023-06-24 00:19:57,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=981138.0, ans=0.0 2023-06-24 00:19:58,267 INFO [train.py:996] (3/4) Epoch 6, batch 11050, loss[loss=0.2246, simple_loss=0.2858, pruned_loss=0.08172, over 21858.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2939, pruned_loss=0.07438, over 4263101.23 frames. ], batch size: 98, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:20:24,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=981138.0, ans=0.0 2023-06-24 00:20:33,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=981198.0, ans=0.125 2023-06-24 00:20:51,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=15.0 2023-06-24 00:21:38,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=981378.0, ans=0.0 2023-06-24 00:21:45,978 INFO [train.py:996] (3/4) Epoch 6, batch 11100, loss[loss=0.2498, simple_loss=0.3112, pruned_loss=0.09424, over 21291.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2918, pruned_loss=0.07429, over 4254130.58 frames. ], batch size: 471, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:22:40,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=981558.0, ans=0.025 2023-06-24 00:22:57,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.03 vs. limit=10.0 2023-06-24 00:23:14,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-24 00:23:23,904 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.487e+02 2.801e+02 3.244e+02 5.802e+02, threshold=5.603e+02, percent-clipped=1.0 2023-06-24 00:23:36,114 INFO [train.py:996] (3/4) Epoch 6, batch 11150, loss[loss=0.2066, simple_loss=0.2718, pruned_loss=0.07069, over 21882.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.289, pruned_loss=0.07388, over 4251111.27 frames. ], batch size: 107, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:23:37,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.87 vs. limit=6.0 2023-06-24 00:24:05,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=981798.0, ans=0.125 2023-06-24 00:24:15,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=981798.0, ans=0.125 2023-06-24 00:24:21,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=981798.0, ans=0.125 2023-06-24 00:24:55,160 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=15.0 2023-06-24 00:25:27,196 INFO [train.py:996] (3/4) Epoch 6, batch 11200, loss[loss=0.2198, simple_loss=0.2781, pruned_loss=0.08071, over 21759.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.288, pruned_loss=0.07351, over 4250829.83 frames. ], batch size: 317, lr: 5.16e-03, grad_scale: 32.0 2023-06-24 00:25:51,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=982038.0, ans=0.125 2023-06-24 00:26:26,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=982158.0, ans=0.0 2023-06-24 00:27:00,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=982278.0, ans=0.1 2023-06-24 00:27:03,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.434e+02 2.676e+02 2.972e+02 5.122e+02, threshold=5.353e+02, percent-clipped=0.0 2023-06-24 00:27:15,147 INFO [train.py:996] (3/4) Epoch 6, batch 11250, loss[loss=0.2453, simple_loss=0.3223, pruned_loss=0.08416, over 21782.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2876, pruned_loss=0.07313, over 4252833.81 frames. ], batch size: 118, lr: 5.16e-03, grad_scale: 32.0 2023-06-24 00:27:22,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=982338.0, ans=0.015 2023-06-24 00:27:23,091 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-24 00:27:23,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-24 00:29:03,621 INFO [train.py:996] (3/4) Epoch 6, batch 11300, loss[loss=0.1996, simple_loss=0.2694, pruned_loss=0.06486, over 20808.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2892, pruned_loss=0.07362, over 4257652.90 frames. ], batch size: 609, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:29:55,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=982758.0, ans=0.125 2023-06-24 00:30:23,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=982818.0, ans=0.125 2023-06-24 00:30:42,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=982878.0, ans=0.2 2023-06-24 00:30:43,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.722e+02 2.481e+02 2.716e+02 3.096e+02 3.979e+02, threshold=5.433e+02, percent-clipped=0.0 2023-06-24 00:31:00,998 INFO [train.py:996] (3/4) Epoch 6, batch 11350, loss[loss=0.2008, simple_loss=0.3075, pruned_loss=0.04705, over 20759.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2893, pruned_loss=0.0722, over 4263630.79 frames. ], batch size: 607, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:31:30,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=982998.0, ans=0.125 2023-06-24 00:32:36,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=983178.0, ans=0.125 2023-06-24 00:32:59,512 INFO [train.py:996] (3/4) Epoch 6, batch 11400, loss[loss=0.2219, simple_loss=0.3103, pruned_loss=0.06676, over 21718.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2948, pruned_loss=0.07416, over 4260189.70 frames. ], batch size: 298, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:34:38,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.559e+02 2.841e+02 3.332e+02 5.224e+02, threshold=5.682e+02, percent-clipped=0.0 2023-06-24 00:34:49,735 INFO [train.py:996] (3/4) Epoch 6, batch 11450, loss[loss=0.2396, simple_loss=0.3174, pruned_loss=0.08095, over 21699.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2957, pruned_loss=0.07308, over 4260641.62 frames. ], batch size: 351, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:35:02,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=983538.0, ans=0.125 2023-06-24 00:35:10,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=983538.0, ans=0.07 2023-06-24 00:35:36,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=983658.0, ans=0.125 2023-06-24 00:35:49,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-24 00:36:15,096 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-24 00:36:29,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=983778.0, ans=0.125 2023-06-24 00:36:46,013 INFO [train.py:996] (3/4) Epoch 6, batch 11500, loss[loss=0.2075, simple_loss=0.2993, pruned_loss=0.05786, over 21769.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3005, pruned_loss=0.07517, over 4264789.41 frames. ], batch size: 298, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:38:29,586 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.699e+02 3.055e+02 3.965e+02 5.631e+02, threshold=6.111e+02, percent-clipped=0.0 2023-06-24 00:38:32,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=984078.0, ans=0.125 2023-06-24 00:38:32,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=984078.0, ans=0.1 2023-06-24 00:38:41,210 INFO [train.py:996] (3/4) Epoch 6, batch 11550, loss[loss=0.2668, simple_loss=0.3643, pruned_loss=0.08462, over 21850.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3075, pruned_loss=0.07607, over 4268308.67 frames. ], batch size: 316, lr: 5.16e-03, grad_scale: 16.0 2023-06-24 00:39:12,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=984198.0, ans=0.125 2023-06-24 00:40:10,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=984318.0, ans=0.125 2023-06-24 00:40:38,810 INFO [train.py:996] (3/4) Epoch 6, batch 11600, loss[loss=0.2644, simple_loss=0.352, pruned_loss=0.08843, over 21389.00 frames. ], tot_loss[loss=0.239, simple_loss=0.322, pruned_loss=0.07797, over 4262561.22 frames. ], batch size: 194, lr: 5.15e-03, grad_scale: 32.0 2023-06-24 00:40:39,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=984438.0, ans=15.0 2023-06-24 00:40:50,629 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-24 00:42:15,284 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 2.879e+02 3.402e+02 4.224e+02 8.565e+02, threshold=6.804e+02, percent-clipped=5.0 2023-06-24 00:42:27,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=984738.0, ans=0.125 2023-06-24 00:42:28,799 INFO [train.py:996] (3/4) Epoch 6, batch 11650, loss[loss=0.2089, simple_loss=0.2823, pruned_loss=0.06774, over 21805.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.327, pruned_loss=0.0784, over 4266496.31 frames. ], batch size: 124, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:43:14,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=984858.0, ans=0.0 2023-06-24 00:44:12,081 INFO [train.py:996] (3/4) Epoch 6, batch 11700, loss[loss=0.2156, simple_loss=0.2814, pruned_loss=0.0749, over 21654.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3179, pruned_loss=0.0785, over 4271443.94 frames. ], batch size: 282, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:44:21,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=985038.0, ans=0.125 2023-06-24 00:44:35,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=985098.0, ans=0.125 2023-06-24 00:44:56,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=985158.0, ans=0.0 2023-06-24 00:45:07,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=985158.0, ans=0.2 2023-06-24 00:45:23,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=22.5 2023-06-24 00:45:52,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.525e+02 2.747e+02 3.370e+02 5.066e+02, threshold=5.494e+02, percent-clipped=0.0 2023-06-24 00:45:56,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=985278.0, ans=0.125 2023-06-24 00:46:01,421 INFO [train.py:996] (3/4) Epoch 6, batch 11750, loss[loss=0.234, simple_loss=0.3034, pruned_loss=0.08224, over 21881.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3098, pruned_loss=0.07771, over 4266609.62 frames. ], batch size: 372, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:47:23,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=985518.0, ans=0.125 2023-06-24 00:47:40,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=985578.0, ans=0.125 2023-06-24 00:47:52,513 INFO [train.py:996] (3/4) Epoch 6, batch 11800, loss[loss=0.243, simple_loss=0.3212, pruned_loss=0.08243, over 19989.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3105, pruned_loss=0.07881, over 4259546.18 frames. ], batch size: 702, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:49:34,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.469e+02 2.710e+02 3.084e+02 4.949e+02, threshold=5.420e+02, percent-clipped=0.0 2023-06-24 00:49:43,735 INFO [train.py:996] (3/4) Epoch 6, batch 11850, loss[loss=0.2347, simple_loss=0.3018, pruned_loss=0.08379, over 21329.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3111, pruned_loss=0.07811, over 4261863.59 frames. ], batch size: 176, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:49:47,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=985938.0, ans=0.125 2023-06-24 00:49:51,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=985938.0, ans=0.0 2023-06-24 00:50:33,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=985998.0, ans=0.04949747468305833 2023-06-24 00:51:11,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=986118.0, ans=0.1 2023-06-24 00:51:34,324 INFO [train.py:996] (3/4) Epoch 6, batch 11900, loss[loss=0.2161, simple_loss=0.3069, pruned_loss=0.06259, over 21835.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3101, pruned_loss=0.07563, over 4268478.64 frames. ], batch size: 371, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:52:16,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=986298.0, ans=0.1 2023-06-24 00:53:16,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.327e+02 2.667e+02 3.121e+02 4.121e+02, threshold=5.333e+02, percent-clipped=0.0 2023-06-24 00:53:31,113 INFO [train.py:996] (3/4) Epoch 6, batch 11950, loss[loss=0.1705, simple_loss=0.242, pruned_loss=0.04952, over 21194.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.31, pruned_loss=0.07255, over 4270913.89 frames. ], batch size: 143, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:53:58,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=986598.0, ans=0.125 2023-06-24 00:55:19,958 INFO [train.py:996] (3/4) Epoch 6, batch 12000, loss[loss=0.1919, simple_loss=0.2606, pruned_loss=0.06159, over 15609.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3054, pruned_loss=0.07139, over 4262543.46 frames. ], batch size: 61, lr: 5.15e-03, grad_scale: 32.0 2023-06-24 00:55:19,959 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 00:55:44,724 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2624, simple_loss=0.3526, pruned_loss=0.08607, over 1796401.00 frames. 2023-06-24 00:55:44,725 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23336MB 2023-06-24 00:55:57,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=986838.0, ans=0.125 2023-06-24 00:55:59,176 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:55:59,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=986838.0, ans=0.2 2023-06-24 00:56:48,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=22.5 2023-06-24 00:56:57,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=987078.0, ans=0.125 2023-06-24 00:57:01,504 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.75 vs. limit=10.0 2023-06-24 00:57:04,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=987078.0, ans=0.125 2023-06-24 00:57:13,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.670e+02 2.572e+02 3.062e+02 3.583e+02 6.186e+02, threshold=6.124e+02, percent-clipped=1.0 2023-06-24 00:57:27,290 INFO [train.py:996] (3/4) Epoch 6, batch 12050, loss[loss=0.2225, simple_loss=0.2876, pruned_loss=0.07869, over 21366.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3032, pruned_loss=0.07398, over 4263241.53 frames. ], batch size: 143, lr: 5.15e-03, grad_scale: 32.0 2023-06-24 00:57:38,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=987138.0, ans=0.125 2023-06-24 00:59:24,247 INFO [train.py:996] (3/4) Epoch 6, batch 12100, loss[loss=0.2453, simple_loss=0.3298, pruned_loss=0.08042, over 21864.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3068, pruned_loss=0.07729, over 4269681.75 frames. ], batch size: 371, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 00:59:39,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=987438.0, ans=0.0 2023-06-24 00:59:45,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=987498.0, ans=0.125 2023-06-24 00:59:54,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=987498.0, ans=0.035 2023-06-24 01:00:11,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=987558.0, ans=12.0 2023-06-24 01:00:20,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=987558.0, ans=0.1 2023-06-24 01:00:49,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=987618.0, ans=0.125 2023-06-24 01:01:03,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=987678.0, ans=0.125 2023-06-24 01:01:05,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=987678.0, ans=0.125 2023-06-24 01:01:06,900 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.682e+02 3.113e+02 3.706e+02 5.999e+02, threshold=6.227e+02, percent-clipped=0.0 2023-06-24 01:01:14,018 INFO [train.py:996] (3/4) Epoch 6, batch 12150, loss[loss=0.2689, simple_loss=0.3842, pruned_loss=0.07683, over 19720.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3109, pruned_loss=0.07689, over 4262738.41 frames. ], batch size: 702, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 01:01:45,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=987798.0, ans=0.1 2023-06-24 01:01:48,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=987798.0, ans=0.0 2023-06-24 01:01:51,007 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-24 01:01:55,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=987858.0, ans=0.125 2023-06-24 01:02:50,223 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:03:08,579 INFO [train.py:996] (3/4) Epoch 6, batch 12200, loss[loss=0.2303, simple_loss=0.2878, pruned_loss=0.08636, over 21845.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3073, pruned_loss=0.07601, over 4260066.26 frames. ], batch size: 98, lr: 5.15e-03, grad_scale: 16.0 2023-06-24 01:03:25,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=988098.0, ans=0.0 2023-06-24 01:03:37,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=988098.0, ans=0.0 2023-06-24 01:04:39,231 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.33 vs. limit=22.5 2023-06-24 01:04:45,495 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.529e+02 2.375e+02 2.667e+02 3.386e+02 5.475e+02, threshold=5.334e+02, percent-clipped=0.0 2023-06-24 01:04:54,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=988278.0, ans=0.035 2023-06-24 01:04:57,296 INFO [train.py:996] (3/4) Epoch 6, batch 12250, loss[loss=0.1612, simple_loss=0.2361, pruned_loss=0.04312, over 21527.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2982, pruned_loss=0.07232, over 4265061.77 frames. ], batch size: 195, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:04:57,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=988338.0, ans=0.125 2023-06-24 01:05:01,897 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2023-06-24 01:05:08,561 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-24 01:05:09,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=988338.0, ans=0.125 2023-06-24 01:05:14,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=988398.0, ans=0.0 2023-06-24 01:05:34,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=12.0 2023-06-24 01:06:19,519 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-24 01:06:40,991 INFO [train.py:996] (3/4) Epoch 6, batch 12300, loss[loss=0.249, simple_loss=0.3385, pruned_loss=0.07972, over 21844.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2915, pruned_loss=0.06701, over 4251307.07 frames. ], batch size: 371, lr: 5.14e-03, grad_scale: 8.0 2023-06-24 01:06:57,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=988638.0, ans=0.125 2023-06-24 01:07:23,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=988758.0, ans=0.035 2023-06-24 01:08:25,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.495e+02 2.150e+02 2.660e+02 3.179e+02 5.593e+02, threshold=5.319e+02, percent-clipped=1.0 2023-06-24 01:08:32,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=988878.0, ans=10.0 2023-06-24 01:08:36,004 INFO [train.py:996] (3/4) Epoch 6, batch 12350, loss[loss=0.2397, simple_loss=0.3134, pruned_loss=0.08297, over 21560.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2975, pruned_loss=0.0679, over 4254539.24 frames. ], batch size: 548, lr: 5.14e-03, grad_scale: 8.0 2023-06-24 01:09:06,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=988998.0, ans=0.0 2023-06-24 01:09:14,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=989058.0, ans=0.1 2023-06-24 01:09:41,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=989058.0, ans=0.125 2023-06-24 01:10:05,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-06-24 01:10:19,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=989178.0, ans=0.125 2023-06-24 01:10:24,593 INFO [train.py:996] (3/4) Epoch 6, batch 12400, loss[loss=0.2374, simple_loss=0.3001, pruned_loss=0.08733, over 21344.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2992, pruned_loss=0.07162, over 4263284.52 frames. ], batch size: 176, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:10:53,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=989298.0, ans=0.0 2023-06-24 01:11:26,103 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:11:34,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=989418.0, ans=0.125 2023-06-24 01:12:08,940 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.631e+02 2.949e+02 3.533e+02 4.721e+02, threshold=5.899e+02, percent-clipped=0.0 2023-06-24 01:12:14,224 INFO [train.py:996] (3/4) Epoch 6, batch 12450, loss[loss=0.2606, simple_loss=0.3372, pruned_loss=0.09201, over 21481.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.303, pruned_loss=0.07514, over 4272044.50 frames. ], batch size: 131, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:12:56,096 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-24 01:13:24,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=989658.0, ans=0.125 2023-06-24 01:13:26,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=989718.0, ans=0.0 2023-06-24 01:13:26,610 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-24 01:14:00,026 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:14:08,605 INFO [train.py:996] (3/4) Epoch 6, batch 12500, loss[loss=0.253, simple_loss=0.36, pruned_loss=0.07298, over 21922.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3126, pruned_loss=0.07782, over 4269284.44 frames. ], batch size: 317, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:14:49,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=989898.0, ans=0.125 2023-06-24 01:14:56,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=989898.0, ans=0.125 2023-06-24 01:14:58,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-24 01:15:01,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=989958.0, ans=0.0 2023-06-24 01:16:01,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 2.735e+02 3.011e+02 3.446e+02 4.823e+02, threshold=6.021e+02, percent-clipped=0.0 2023-06-24 01:16:07,467 INFO [train.py:996] (3/4) Epoch 6, batch 12550, loss[loss=0.2552, simple_loss=0.3293, pruned_loss=0.0906, over 21818.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.316, pruned_loss=0.08015, over 4274687.23 frames. ], batch size: 118, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:16:15,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=990138.0, ans=0.125 2023-06-24 01:16:22,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=990138.0, ans=0.1 2023-06-24 01:16:24,476 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-24 01:16:27,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=990138.0, ans=0.125 2023-06-24 01:16:29,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.30 vs. limit=15.0 2023-06-24 01:17:19,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=990318.0, ans=0.125 2023-06-24 01:18:03,115 INFO [train.py:996] (3/4) Epoch 6, batch 12600, loss[loss=0.204, simple_loss=0.2903, pruned_loss=0.05887, over 21595.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3142, pruned_loss=0.07798, over 4267914.44 frames. ], batch size: 230, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:18:06,145 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-24 01:18:18,557 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-24 01:18:19,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=990498.0, ans=0.2 2023-06-24 01:18:31,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=990498.0, ans=0.1 2023-06-24 01:19:01,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=990618.0, ans=0.2 2023-06-24 01:19:46,574 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.342e+02 2.712e+02 3.358e+02 5.513e+02, threshold=5.424e+02, percent-clipped=0.0 2023-06-24 01:19:51,711 INFO [train.py:996] (3/4) Epoch 6, batch 12650, loss[loss=0.2492, simple_loss=0.3652, pruned_loss=0.0666, over 20758.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3063, pruned_loss=0.07327, over 4274759.72 frames. ], batch size: 608, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:20:31,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=990858.0, ans=0.125 2023-06-24 01:20:33,223 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:21:32,248 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:21:40,691 INFO [train.py:996] (3/4) Epoch 6, batch 12700, loss[loss=0.2381, simple_loss=0.315, pruned_loss=0.08058, over 21943.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3057, pruned_loss=0.07538, over 4280615.99 frames. ], batch size: 372, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:22:14,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=991098.0, ans=0.125 2023-06-24 01:23:14,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=991278.0, ans=0.1 2023-06-24 01:23:25,526 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.607e+02 2.938e+02 3.445e+02 5.217e+02, threshold=5.876e+02, percent-clipped=0.0 2023-06-24 01:23:26,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=991278.0, ans=0.125 2023-06-24 01:23:31,058 INFO [train.py:996] (3/4) Epoch 6, batch 12750, loss[loss=0.2184, simple_loss=0.2954, pruned_loss=0.07074, over 21772.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3072, pruned_loss=0.07565, over 4282596.15 frames. ], batch size: 298, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:24:16,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=991458.0, ans=0.0 2023-06-24 01:24:17,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=22.5 2023-06-24 01:24:23,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=991458.0, ans=0.0 2023-06-24 01:25:02,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.79 vs. limit=10.0 2023-06-24 01:25:15,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.49 vs. limit=10.0 2023-06-24 01:25:19,771 INFO [train.py:996] (3/4) Epoch 6, batch 12800, loss[loss=0.2097, simple_loss=0.2836, pruned_loss=0.06788, over 21812.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3066, pruned_loss=0.07655, over 4287968.64 frames. ], batch size: 247, lr: 5.14e-03, grad_scale: 32.0 2023-06-24 01:25:58,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=991758.0, ans=0.125 2023-06-24 01:26:10,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=991758.0, ans=0.0 2023-06-24 01:26:11,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.11 vs. limit=15.0 2023-06-24 01:26:45,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=991818.0, ans=0.2 2023-06-24 01:26:45,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=991818.0, ans=0.1 2023-06-24 01:27:06,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.498e+02 2.671e+02 3.042e+02 5.514e+02, threshold=5.341e+02, percent-clipped=0.0 2023-06-24 01:27:10,338 INFO [train.py:996] (3/4) Epoch 6, batch 12850, loss[loss=0.1952, simple_loss=0.2872, pruned_loss=0.05159, over 21735.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3094, pruned_loss=0.07856, over 4289073.38 frames. ], batch size: 247, lr: 5.14e-03, grad_scale: 16.0 2023-06-24 01:28:19,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=992058.0, ans=0.125 2023-06-24 01:28:43,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=992178.0, ans=0.0 2023-06-24 01:28:53,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=992178.0, ans=15.0 2023-06-24 01:29:08,023 INFO [train.py:996] (3/4) Epoch 6, batch 12900, loss[loss=0.1942, simple_loss=0.2717, pruned_loss=0.05835, over 21412.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3067, pruned_loss=0.07477, over 4280283.14 frames. ], batch size: 194, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:29:12,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992238.0, ans=0.1 2023-06-24 01:29:20,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=992238.0, ans=0.125 2023-06-24 01:29:26,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=992298.0, ans=0.125 2023-06-24 01:30:55,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 2.252e+02 2.502e+02 2.973e+02 5.465e+02, threshold=5.003e+02, percent-clipped=1.0 2023-06-24 01:30:58,568 INFO [train.py:996] (3/4) Epoch 6, batch 12950, loss[loss=0.2248, simple_loss=0.3022, pruned_loss=0.07374, over 21933.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3049, pruned_loss=0.07284, over 4275645.83 frames. ], batch size: 317, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:31:02,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=992538.0, ans=0.0 2023-06-24 01:31:54,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=992658.0, ans=0.95 2023-06-24 01:32:01,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=992658.0, ans=0.5 2023-06-24 01:32:15,573 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:32:47,509 INFO [train.py:996] (3/4) Epoch 6, batch 13000, loss[loss=0.2345, simple_loss=0.313, pruned_loss=0.07801, over 21621.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3072, pruned_loss=0.07462, over 4257777.83 frames. ], batch size: 441, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:32:51,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=992838.0, ans=0.125 2023-06-24 01:32:54,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=992838.0, ans=0.0 2023-06-24 01:32:56,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=992838.0, ans=0.2 2023-06-24 01:34:00,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=993018.0, ans=0.2 2023-06-24 01:34:00,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=993018.0, ans=0.125 2023-06-24 01:34:14,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=993018.0, ans=0.0 2023-06-24 01:34:33,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.751e+02 2.511e+02 2.962e+02 3.599e+02 5.386e+02, threshold=5.923e+02, percent-clipped=1.0 2023-06-24 01:34:36,039 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:34:36,945 INFO [train.py:996] (3/4) Epoch 6, batch 13050, loss[loss=0.2245, simple_loss=0.2968, pruned_loss=0.0761, over 21872.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3026, pruned_loss=0.07232, over 4262212.31 frames. ], batch size: 371, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:35:00,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=993198.0, ans=0.0 2023-06-24 01:35:59,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=993318.0, ans=0.0 2023-06-24 01:36:15,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=993378.0, ans=0.125 2023-06-24 01:36:21,535 INFO [train.py:996] (3/4) Epoch 6, batch 13100, loss[loss=0.2558, simple_loss=0.3293, pruned_loss=0.09114, over 21304.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.306, pruned_loss=0.07275, over 4265020.27 frames. ], batch size: 159, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:36:34,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=993438.0, ans=0.1 2023-06-24 01:36:36,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=993438.0, ans=0.0 2023-06-24 01:37:13,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=993558.0, ans=0.125 2023-06-24 01:37:18,642 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.37 vs. limit=22.5 2023-06-24 01:37:35,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=993618.0, ans=0.125 2023-06-24 01:37:37,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=993618.0, ans=0.0 2023-06-24 01:38:09,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.775e+02 3.249e+02 4.198e+02 6.182e+02, threshold=6.497e+02, percent-clipped=2.0 2023-06-24 01:38:18,939 INFO [train.py:996] (3/4) Epoch 6, batch 13150, loss[loss=0.1802, simple_loss=0.2623, pruned_loss=0.04907, over 21584.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.308, pruned_loss=0.07522, over 4265331.58 frames. ], batch size: 263, lr: 5.13e-03, grad_scale: 16.0 2023-06-24 01:38:28,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=993738.0, ans=0.0 2023-06-24 01:38:32,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=993738.0, ans=0.125 2023-06-24 01:38:46,461 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-24 01:38:51,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=993798.0, ans=0.1 2023-06-24 01:39:11,501 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-24 01:39:23,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=993918.0, ans=0.04949747468305833 2023-06-24 01:39:42,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=993918.0, ans=0.125 2023-06-24 01:40:09,767 INFO [train.py:996] (3/4) Epoch 6, batch 13200, loss[loss=0.2396, simple_loss=0.3077, pruned_loss=0.08578, over 21271.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3049, pruned_loss=0.07455, over 4267994.76 frames. ], batch size: 549, lr: 5.13e-03, grad_scale: 32.0 2023-06-24 01:40:34,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=994098.0, ans=0.0 2023-06-24 01:41:56,114 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.674e+02 2.987e+02 3.685e+02 5.841e+02, threshold=5.974e+02, percent-clipped=0.0 2023-06-24 01:41:59,679 INFO [train.py:996] (3/4) Epoch 6, batch 13250, loss[loss=0.2256, simple_loss=0.2956, pruned_loss=0.07781, over 21822.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.305, pruned_loss=0.07705, over 4277365.44 frames. ], batch size: 107, lr: 5.13e-03, grad_scale: 32.0 2023-06-24 01:42:15,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=994398.0, ans=0.0 2023-06-24 01:42:19,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=994398.0, ans=0.2 2023-06-24 01:42:28,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=994398.0, ans=0.2 2023-06-24 01:43:49,654 INFO [train.py:996] (3/4) Epoch 6, batch 13300, loss[loss=0.2202, simple_loss=0.3176, pruned_loss=0.06146, over 20806.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3083, pruned_loss=0.07686, over 4279554.54 frames. ], batch size: 609, lr: 5.13e-03, grad_scale: 32.0 2023-06-24 01:44:33,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=994698.0, ans=0.125 2023-06-24 01:45:13,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=994818.0, ans=0.2 2023-06-24 01:45:40,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=994938.0, ans=0.125 2023-06-24 01:45:41,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.934e+02 2.520e+02 2.865e+02 3.222e+02 4.480e+02, threshold=5.730e+02, percent-clipped=0.0 2023-06-24 01:45:41,781 INFO [train.py:996] (3/4) Epoch 6, batch 13350, loss[loss=0.2458, simple_loss=0.3274, pruned_loss=0.08214, over 21733.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3129, pruned_loss=0.07958, over 4280790.89 frames. ], batch size: 298, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:47:28,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=995178.0, ans=0.2 2023-06-24 01:47:32,581 INFO [train.py:996] (3/4) Epoch 6, batch 13400, loss[loss=0.2504, simple_loss=0.3168, pruned_loss=0.092, over 21304.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3141, pruned_loss=0.08146, over 4282227.50 frames. ], batch size: 176, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:47:33,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2023-06-24 01:47:35,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=995238.0, ans=0.0 2023-06-24 01:47:36,598 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:47:39,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=995238.0, ans=0.0 2023-06-24 01:47:50,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=995298.0, ans=0.0 2023-06-24 01:48:26,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=995358.0, ans=0.2 2023-06-24 01:48:44,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=995418.0, ans=0.1 2023-06-24 01:48:55,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=995418.0, ans=0.0 2023-06-24 01:49:23,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 2.783e+02 3.072e+02 3.557e+02 5.639e+02, threshold=6.143e+02, percent-clipped=0.0 2023-06-24 01:49:23,360 INFO [train.py:996] (3/4) Epoch 6, batch 13450, loss[loss=0.2228, simple_loss=0.2885, pruned_loss=0.07861, over 21789.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3151, pruned_loss=0.08223, over 4276043.56 frames. ], batch size: 118, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:49:34,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=995538.0, ans=0.1 2023-06-24 01:49:45,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=995598.0, ans=0.125 2023-06-24 01:51:13,915 INFO [train.py:996] (3/4) Epoch 6, batch 13500, loss[loss=0.2204, simple_loss=0.295, pruned_loss=0.07285, over 21799.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3051, pruned_loss=0.07961, over 4272825.47 frames. ], batch size: 352, lr: 5.13e-03, grad_scale: 8.0 2023-06-24 01:51:32,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=995898.0, ans=0.2 2023-06-24 01:52:40,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.07 vs. limit=10.0 2023-06-24 01:53:06,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.929e+02 2.607e+02 3.013e+02 3.630e+02 7.011e+02, threshold=6.026e+02, percent-clipped=1.0 2023-06-24 01:53:06,818 INFO [train.py:996] (3/4) Epoch 6, batch 13550, loss[loss=0.2728, simple_loss=0.3829, pruned_loss=0.08132, over 21645.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3096, pruned_loss=0.07916, over 4278345.82 frames. ], batch size: 441, lr: 5.12e-03, grad_scale: 8.0 2023-06-24 01:53:49,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=996198.0, ans=0.125 2023-06-24 01:53:50,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=996198.0, ans=0.0 2023-06-24 01:53:57,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=996258.0, ans=0.125 2023-06-24 01:54:13,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=996258.0, ans=0.04949747468305833 2023-06-24 01:54:13,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=996258.0, ans=0.125 2023-06-24 01:54:32,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=996378.0, ans=0.125 2023-06-24 01:54:48,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=996378.0, ans=0.0 2023-06-24 01:54:57,329 INFO [train.py:996] (3/4) Epoch 6, batch 13600, loss[loss=0.2205, simple_loss=0.284, pruned_loss=0.07852, over 21797.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3114, pruned_loss=0.07982, over 4283024.93 frames. ], batch size: 247, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 01:55:01,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=996438.0, ans=0.1 2023-06-24 01:55:11,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=996438.0, ans=0.125 2023-06-24 01:55:44,944 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-24 01:56:19,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=996618.0, ans=0.0 2023-06-24 01:56:42,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=996678.0, ans=0.125 2023-06-24 01:56:47,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.489e+02 2.780e+02 3.135e+02 6.333e+02, threshold=5.560e+02, percent-clipped=1.0 2023-06-24 01:56:47,223 INFO [train.py:996] (3/4) Epoch 6, batch 13650, loss[loss=0.1935, simple_loss=0.243, pruned_loss=0.07202, over 19997.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3063, pruned_loss=0.07671, over 4279663.09 frames. ], batch size: 703, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 01:56:55,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=996738.0, ans=0.125 2023-06-24 01:57:11,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-24 01:57:12,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=996738.0, ans=0.2 2023-06-24 01:57:30,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=996798.0, ans=0.09899494936611666 2023-06-24 01:58:29,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=996978.0, ans=0.125 2023-06-24 01:58:37,426 INFO [train.py:996] (3/4) Epoch 6, batch 13700, loss[loss=0.1983, simple_loss=0.2644, pruned_loss=0.06609, over 21466.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2999, pruned_loss=0.07629, over 4280593.74 frames. ], batch size: 211, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 01:59:00,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=997038.0, ans=0.125 2023-06-24 01:59:06,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=997098.0, ans=0.125 2023-06-24 01:59:13,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=997098.0, ans=0.125 2023-06-24 01:59:14,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=997098.0, ans=0.0 2023-06-24 01:59:18,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=997098.0, ans=0.0 2023-06-24 01:59:49,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=997218.0, ans=0.1 2023-06-24 02:00:05,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=997218.0, ans=0.125 2023-06-24 02:00:41,406 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.702e+02 3.112e+02 3.506e+02 5.710e+02, threshold=6.223e+02, percent-clipped=1.0 2023-06-24 02:00:41,440 INFO [train.py:996] (3/4) Epoch 6, batch 13750, loss[loss=0.1959, simple_loss=0.2603, pruned_loss=0.06579, over 21275.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2992, pruned_loss=0.07599, over 4284698.34 frames. ], batch size: 176, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:00:45,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=997338.0, ans=0.2 2023-06-24 02:00:49,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=997338.0, ans=0.125 2023-06-24 02:01:04,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=997398.0, ans=0.125 2023-06-24 02:01:51,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=997518.0, ans=0.0 2023-06-24 02:02:21,254 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-24 02:02:30,755 INFO [train.py:996] (3/4) Epoch 6, batch 13800, loss[loss=0.2505, simple_loss=0.3455, pruned_loss=0.07773, over 21767.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3038, pruned_loss=0.07501, over 4278329.50 frames. ], batch size: 282, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:02:35,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=997638.0, ans=0.125 2023-06-24 02:03:22,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-24 02:03:23,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=997758.0, ans=0.2 2023-06-24 02:03:56,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=997818.0, ans=0.0 2023-06-24 02:04:18,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=997878.0, ans=0.125 2023-06-24 02:04:18,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=997878.0, ans=0.95 2023-06-24 02:04:22,873 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.948e+02 3.505e+02 4.086e+02 7.226e+02, threshold=7.009e+02, percent-clipped=3.0 2023-06-24 02:04:22,917 INFO [train.py:996] (3/4) Epoch 6, batch 13850, loss[loss=0.2817, simple_loss=0.3603, pruned_loss=0.1016, over 21715.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3117, pruned_loss=0.0765, over 4286364.93 frames. ], batch size: 441, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:05:25,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=998058.0, ans=0.0 2023-06-24 02:05:50,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=998118.0, ans=0.2 2023-06-24 02:06:06,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=998178.0, ans=0.125 2023-06-24 02:06:09,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=998178.0, ans=0.125 2023-06-24 02:06:17,490 INFO [train.py:996] (3/4) Epoch 6, batch 13900, loss[loss=0.2625, simple_loss=0.3229, pruned_loss=0.1011, over 19943.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3147, pruned_loss=0.07981, over 4284526.23 frames. ], batch size: 702, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:06:22,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=998238.0, ans=0.125 2023-06-24 02:06:23,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=998238.0, ans=0.0 2023-06-24 02:06:28,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=998238.0, ans=0.125 2023-06-24 02:06:54,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=998298.0, ans=0.0 2023-06-24 02:08:08,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.809e+02 3.184e+02 3.702e+02 5.147e+02, threshold=6.368e+02, percent-clipped=0.0 2023-06-24 02:08:08,450 INFO [train.py:996] (3/4) Epoch 6, batch 13950, loss[loss=0.2367, simple_loss=0.3042, pruned_loss=0.08459, over 21432.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3153, pruned_loss=0.08105, over 4285751.18 frames. ], batch size: 211, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:08:15,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=998538.0, ans=0.0 2023-06-24 02:08:27,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=998538.0, ans=0.125 2023-06-24 02:08:41,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=998598.0, ans=0.0 2023-06-24 02:08:53,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=998658.0, ans=0.1 2023-06-24 02:09:23,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=998718.0, ans=0.125 2023-06-24 02:09:42,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.00 vs. limit=22.5 2023-06-24 02:09:57,213 INFO [train.py:996] (3/4) Epoch 6, batch 14000, loss[loss=0.2086, simple_loss=0.2986, pruned_loss=0.05933, over 21698.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3102, pruned_loss=0.0781, over 4281440.29 frames. ], batch size: 389, lr: 5.12e-03, grad_scale: 32.0 2023-06-24 02:10:27,135 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.85 vs. limit=15.0 2023-06-24 02:11:06,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=999018.0, ans=0.125 2023-06-24 02:11:46,160 INFO [train.py:996] (3/4) Epoch 6, batch 14050, loss[loss=0.2007, simple_loss=0.266, pruned_loss=0.06766, over 21239.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3054, pruned_loss=0.07429, over 4277676.40 frames. ], batch size: 548, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:11:47,705 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.602e+02 2.313e+02 2.760e+02 3.193e+02 4.998e+02, threshold=5.521e+02, percent-clipped=0.0 2023-06-24 02:13:04,584 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=22.5 2023-06-24 02:13:33,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=999438.0, ans=0.1 2023-06-24 02:13:35,267 INFO [train.py:996] (3/4) Epoch 6, batch 14100, loss[loss=0.2025, simple_loss=0.2725, pruned_loss=0.06624, over 21721.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2999, pruned_loss=0.07425, over 4274029.00 frames. ], batch size: 247, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:15:15,652 INFO [train.py:996] (3/4) Epoch 6, batch 14150, loss[loss=0.2391, simple_loss=0.3278, pruned_loss=0.07517, over 21893.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3047, pruned_loss=0.07571, over 4281844.00 frames. ], batch size: 98, lr: 5.12e-03, grad_scale: 16.0 2023-06-24 02:15:17,246 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.422e+02 2.767e+02 3.253e+02 5.449e+02, threshold=5.534e+02, percent-clipped=0.0 2023-06-24 02:15:38,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=999798.0, ans=0.0 2023-06-24 02:16:40,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=999918.0, ans=0.2 2023-06-24 02:16:57,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=999978.0, ans=0.2 2023-06-24 02:17:02,004 INFO [train.py:996] (3/4) Epoch 6, batch 14200, loss[loss=0.248, simple_loss=0.3151, pruned_loss=0.09048, over 21720.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3024, pruned_loss=0.07423, over 4273371.56 frames. ], batch size: 441, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:18:24,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1000218.0, ans=0.125 2023-06-24 02:18:52,104 INFO [train.py:996] (3/4) Epoch 6, batch 14250, loss[loss=0.2298, simple_loss=0.29, pruned_loss=0.0848, over 20099.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2967, pruned_loss=0.07407, over 4265105.18 frames. ], batch size: 703, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:18:53,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.255e+02 2.600e+02 3.105e+02 6.584e+02, threshold=5.199e+02, percent-clipped=1.0 2023-06-24 02:19:11,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1000338.0, ans=0.2 2023-06-24 02:19:17,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1000398.0, ans=0.125 2023-06-24 02:19:41,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1000458.0, ans=0.0 2023-06-24 02:20:44,852 INFO [train.py:996] (3/4) Epoch 6, batch 14300, loss[loss=0.3112, simple_loss=0.403, pruned_loss=0.1097, over 21651.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.298, pruned_loss=0.07392, over 4243756.05 frames. ], batch size: 389, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:21:38,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1000758.0, ans=0.0 2023-06-24 02:21:58,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1000818.0, ans=0.125 2023-06-24 02:22:01,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1000818.0, ans=0.1 2023-06-24 02:22:01,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1000818.0, ans=0.04949747468305833 2023-06-24 02:22:31,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1000878.0, ans=0.125 2023-06-24 02:22:34,133 INFO [train.py:996] (3/4) Epoch 6, batch 14350, loss[loss=0.2292, simple_loss=0.3223, pruned_loss=0.06802, over 21839.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3028, pruned_loss=0.07404, over 4255737.35 frames. ], batch size: 316, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:22:34,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1000938.0, ans=0.2 2023-06-24 02:22:36,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.695e+02 2.573e+02 3.287e+02 4.161e+02 6.824e+02, threshold=6.573e+02, percent-clipped=7.0 2023-06-24 02:22:47,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1000938.0, ans=0.125 2023-06-24 02:22:47,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.08 vs. limit=10.0 2023-06-24 02:23:58,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1001118.0, ans=0.0 2023-06-24 02:24:09,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1001178.0, ans=0.125 2023-06-24 02:24:25,757 INFO [train.py:996] (3/4) Epoch 6, batch 14400, loss[loss=0.2143, simple_loss=0.2896, pruned_loss=0.06951, over 20955.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3028, pruned_loss=0.07463, over 4258502.96 frames. ], batch size: 608, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:25:04,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1001358.0, ans=0.1 2023-06-24 02:25:42,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=22.5 2023-06-24 02:25:45,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1001418.0, ans=0.125 2023-06-24 02:26:04,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1001478.0, ans=0.125 2023-06-24 02:26:09,489 INFO [train.py:996] (3/4) Epoch 6, batch 14450, loss[loss=0.2004, simple_loss=0.2689, pruned_loss=0.06598, over 21597.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2973, pruned_loss=0.07453, over 4266100.66 frames. ], batch size: 263, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:26:16,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.443e+02 2.785e+02 3.113e+02 5.962e+02, threshold=5.570e+02, percent-clipped=0.0 2023-06-24 02:26:22,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1001538.0, ans=0.0 2023-06-24 02:27:56,671 INFO [train.py:996] (3/4) Epoch 6, batch 14500, loss[loss=0.1934, simple_loss=0.2731, pruned_loss=0.05686, over 21756.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2936, pruned_loss=0.07392, over 4265961.39 frames. ], batch size: 112, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:28:32,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1001898.0, ans=0.0 2023-06-24 02:28:42,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1001958.0, ans=0.2 2023-06-24 02:29:00,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1001958.0, ans=0.125 2023-06-24 02:29:09,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1002018.0, ans=0.0 2023-06-24 02:29:52,398 INFO [train.py:996] (3/4) Epoch 6, batch 14550, loss[loss=0.2606, simple_loss=0.3335, pruned_loss=0.09382, over 21687.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3001, pruned_loss=0.07622, over 4260228.13 frames. ], batch size: 298, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:30:01,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.903e+02 2.448e+02 2.869e+02 3.616e+02 7.079e+02, threshold=5.738e+02, percent-clipped=4.0 2023-06-24 02:30:11,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1002138.0, ans=0.125 2023-06-24 02:31:16,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1002318.0, ans=0.125 2023-06-24 02:31:46,402 INFO [train.py:996] (3/4) Epoch 6, batch 14600, loss[loss=0.2572, simple_loss=0.3434, pruned_loss=0.08547, over 21530.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3087, pruned_loss=0.08042, over 4267568.18 frames. ], batch size: 131, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:31:59,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1002438.0, ans=0.125 2023-06-24 02:32:00,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1002438.0, ans=0.2 2023-06-24 02:32:23,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-24 02:32:28,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1002558.0, ans=0.0 2023-06-24 02:33:25,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1002678.0, ans=0.0 2023-06-24 02:33:28,146 INFO [train.py:996] (3/4) Epoch 6, batch 14650, loss[loss=0.2273, simple_loss=0.3114, pruned_loss=0.0716, over 21697.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3111, pruned_loss=0.07932, over 4263623.18 frames. ], batch size: 230, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:33:28,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1002738.0, ans=0.0 2023-06-24 02:33:31,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.829e+02 2.911e+02 3.568e+02 4.716e+02 7.092e+02, threshold=7.135e+02, percent-clipped=11.0 2023-06-24 02:33:40,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1002738.0, ans=0.2 2023-06-24 02:33:46,000 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:34:18,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1002858.0, ans=0.125 2023-06-24 02:34:20,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1002858.0, ans=0.0 2023-06-24 02:34:20,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-24 02:35:03,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1002978.0, ans=15.0 2023-06-24 02:35:09,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1002978.0, ans=0.125 2023-06-24 02:35:14,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1003038.0, ans=0.025 2023-06-24 02:35:15,491 INFO [train.py:996] (3/4) Epoch 6, batch 14700, loss[loss=0.1855, simple_loss=0.2726, pruned_loss=0.04924, over 21273.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3044, pruned_loss=0.07392, over 4264008.14 frames. ], batch size: 159, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:36:07,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1003158.0, ans=0.125 2023-06-24 02:36:18,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1003218.0, ans=0.2 2023-06-24 02:36:19,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1003218.0, ans=0.125 2023-06-24 02:37:05,511 INFO [train.py:996] (3/4) Epoch 6, batch 14750, loss[loss=0.2652, simple_loss=0.338, pruned_loss=0.09624, over 21510.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3077, pruned_loss=0.07606, over 4264993.87 frames. ], batch size: 194, lr: 5.11e-03, grad_scale: 16.0 2023-06-24 02:37:08,872 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.655e+02 2.584e+02 3.183e+02 3.769e+02 5.952e+02, threshold=6.365e+02, percent-clipped=0.0 2023-06-24 02:37:22,531 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-24 02:38:16,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1003518.0, ans=0.125 2023-06-24 02:38:40,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1003578.0, ans=0.125 2023-06-24 02:38:59,556 INFO [train.py:996] (3/4) Epoch 6, batch 14800, loss[loss=0.2378, simple_loss=0.3019, pruned_loss=0.08682, over 21804.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3185, pruned_loss=0.08158, over 4267672.72 frames. ], batch size: 107, lr: 5.11e-03, grad_scale: 32.0 2023-06-24 02:39:21,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1003698.0, ans=0.0 2023-06-24 02:39:30,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1003698.0, ans=0.125 2023-06-24 02:39:40,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-24 02:39:55,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1003758.0, ans=0.0 2023-06-24 02:40:55,586 INFO [train.py:996] (3/4) Epoch 6, batch 14850, loss[loss=0.2101, simple_loss=0.2802, pruned_loss=0.06997, over 21704.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3115, pruned_loss=0.08114, over 4264232.50 frames. ], batch size: 298, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:40:59,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.678e+02 3.116e+02 4.005e+02 6.901e+02, threshold=6.233e+02, percent-clipped=1.0 2023-06-24 02:41:01,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1003938.0, ans=0.125 2023-06-24 02:41:02,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1003938.0, ans=0.125 2023-06-24 02:41:09,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-24 02:41:59,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1004118.0, ans=0.2 2023-06-24 02:42:00,798 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:42:14,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-24 02:42:44,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1004178.0, ans=0.04949747468305833 2023-06-24 02:42:47,053 INFO [train.py:996] (3/4) Epoch 6, batch 14900, loss[loss=0.2758, simple_loss=0.3425, pruned_loss=0.1046, over 21805.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3149, pruned_loss=0.08244, over 4266809.60 frames. ], batch size: 441, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:42:56,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1004238.0, ans=0.1 2023-06-24 02:43:15,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1004298.0, ans=0.0 2023-06-24 02:43:36,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1004358.0, ans=0.0 2023-06-24 02:44:35,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1004538.0, ans=0.125 2023-06-24 02:44:36,518 INFO [train.py:996] (3/4) Epoch 6, batch 14950, loss[loss=0.243, simple_loss=0.3245, pruned_loss=0.08076, over 21278.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3153, pruned_loss=0.08142, over 4265327.53 frames. ], batch size: 159, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:44:39,948 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.635e+02 3.010e+02 3.574e+02 5.643e+02, threshold=6.019e+02, percent-clipped=0.0 2023-06-24 02:44:55,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-06-24 02:46:04,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1004718.0, ans=0.0 2023-06-24 02:46:20,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1004778.0, ans=0.0 2023-06-24 02:46:24,989 INFO [train.py:996] (3/4) Epoch 6, batch 15000, loss[loss=0.235, simple_loss=0.3076, pruned_loss=0.08126, over 21825.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3186, pruned_loss=0.08352, over 4268541.17 frames. ], batch size: 332, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:46:24,990 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 02:46:38,398 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.6773, 3.7474, 3.5442, 3.7614], device='cuda:3') 2023-06-24 02:46:45,299 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2621, simple_loss=0.3511, pruned_loss=0.08652, over 1796401.00 frames. 2023-06-24 02:46:45,299 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23336MB 2023-06-24 02:46:45,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1004838.0, ans=0.0 2023-06-24 02:47:26,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1004898.0, ans=0.0 2023-06-24 02:48:36,411 INFO [train.py:996] (3/4) Epoch 6, batch 15050, loss[loss=0.2437, simple_loss=0.3412, pruned_loss=0.07306, over 20744.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3187, pruned_loss=0.08394, over 4260339.91 frames. ], batch size: 607, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:48:45,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.748e+02 3.194e+02 3.808e+02 5.890e+02, threshold=6.387e+02, percent-clipped=0.0 2023-06-24 02:50:31,390 INFO [train.py:996] (3/4) Epoch 6, batch 15100, loss[loss=0.2462, simple_loss=0.3224, pruned_loss=0.08495, over 21429.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3199, pruned_loss=0.08348, over 4258429.45 frames. ], batch size: 131, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:51:14,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1005498.0, ans=0.125 2023-06-24 02:51:47,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1005618.0, ans=0.0 2023-06-24 02:51:50,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-24 02:52:20,509 INFO [train.py:996] (3/4) Epoch 6, batch 15150, loss[loss=0.2059, simple_loss=0.2617, pruned_loss=0.07503, over 21218.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3168, pruned_loss=0.08395, over 4261810.41 frames. ], batch size: 549, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:52:21,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-24 02:52:29,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.489e+02 2.718e+02 3.127e+02 6.231e+02, threshold=5.435e+02, percent-clipped=0.0 2023-06-24 02:52:58,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1005798.0, ans=0.0 2023-06-24 02:52:58,952 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-24 02:53:10,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1005858.0, ans=0.125 2023-06-24 02:54:14,596 INFO [train.py:996] (3/4) Epoch 6, batch 15200, loss[loss=0.156, simple_loss=0.2111, pruned_loss=0.05042, over 15710.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3085, pruned_loss=0.08017, over 4258469.69 frames. ], batch size: 60, lr: 5.10e-03, grad_scale: 32.0 2023-06-24 02:55:30,585 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.09 vs. limit=22.5 2023-06-24 02:56:03,338 INFO [train.py:996] (3/4) Epoch 6, batch 15250, loss[loss=0.2062, simple_loss=0.2701, pruned_loss=0.07117, over 21596.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3032, pruned_loss=0.07852, over 4258747.94 frames. ], batch size: 263, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 02:56:13,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.617e+02 2.536e+02 2.850e+02 3.419e+02 5.207e+02, threshold=5.701e+02, percent-clipped=0.0 2023-06-24 02:57:20,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1006518.0, ans=22.5 2023-06-24 02:57:22,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1006518.0, ans=0.2 2023-06-24 02:57:45,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1006578.0, ans=0.125 2023-06-24 02:57:58,596 INFO [train.py:996] (3/4) Epoch 6, batch 15300, loss[loss=0.2608, simple_loss=0.3422, pruned_loss=0.08972, over 21809.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3065, pruned_loss=0.08156, over 4254541.72 frames. ], batch size: 118, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 02:58:14,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1006698.0, ans=0.1 2023-06-24 02:58:32,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1006698.0, ans=0.0 2023-06-24 02:59:24,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1006878.0, ans=0.125 2023-06-24 02:59:24,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.72 vs. limit=22.5 2023-06-24 02:59:43,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1006878.0, ans=0.0 2023-06-24 02:59:48,119 INFO [train.py:996] (3/4) Epoch 6, batch 15350, loss[loss=0.2385, simple_loss=0.3342, pruned_loss=0.07141, over 21672.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3105, pruned_loss=0.08311, over 4256574.52 frames. ], batch size: 414, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 02:59:52,935 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.681e+02 3.062e+02 3.788e+02 5.909e+02, threshold=6.124e+02, percent-clipped=1.0 2023-06-24 03:00:28,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1007058.0, ans=0.125 2023-06-24 03:01:23,777 INFO [train.py:996] (3/4) Epoch 6, batch 15400, loss[loss=0.2151, simple_loss=0.2835, pruned_loss=0.07335, over 21300.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3105, pruned_loss=0.08102, over 4253166.71 frames. ], batch size: 143, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 03:01:36,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1007238.0, ans=0.0 2023-06-24 03:02:13,409 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=10.0 2023-06-24 03:02:55,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1007478.0, ans=0.125 2023-06-24 03:03:12,814 INFO [train.py:996] (3/4) Epoch 6, batch 15450, loss[loss=0.214, simple_loss=0.2737, pruned_loss=0.07713, over 21623.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3079, pruned_loss=0.07987, over 4252255.18 frames. ], batch size: 548, lr: 5.10e-03, grad_scale: 16.0 2023-06-24 03:03:23,431 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.379e+02 2.689e+02 3.180e+02 6.204e+02, threshold=5.379e+02, percent-clipped=1.0 2023-06-24 03:03:29,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=22.5 2023-06-24 03:04:05,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1007658.0, ans=0.125 2023-06-24 03:04:34,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1007718.0, ans=0.125 2023-06-24 03:04:55,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1007778.0, ans=0.0 2023-06-24 03:05:07,242 INFO [train.py:996] (3/4) Epoch 6, batch 15500, loss[loss=0.2667, simple_loss=0.3369, pruned_loss=0.09831, over 21432.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3107, pruned_loss=0.07952, over 4250237.15 frames. ], batch size: 211, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:05:22,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1007838.0, ans=0.125 2023-06-24 03:05:48,750 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-24 03:06:24,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1008018.0, ans=0.125 2023-06-24 03:06:58,615 INFO [train.py:996] (3/4) Epoch 6, batch 15550, loss[loss=0.22, simple_loss=0.3127, pruned_loss=0.06367, over 21633.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3094, pruned_loss=0.07756, over 4259755.71 frames. ], batch size: 389, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:07:03,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.505e+02 2.792e+02 3.296e+02 4.983e+02, threshold=5.584e+02, percent-clipped=0.0 2023-06-24 03:07:25,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-24 03:08:46,196 INFO [train.py:996] (3/4) Epoch 6, batch 15600, loss[loss=0.1927, simple_loss=0.2629, pruned_loss=0.06128, over 21595.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3034, pruned_loss=0.07574, over 4264097.41 frames. ], batch size: 247, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:08:52,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1008438.0, ans=0.1 2023-06-24 03:09:46,143 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-24 03:09:49,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1008618.0, ans=0.2 2023-06-24 03:10:10,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1008678.0, ans=0.125 2023-06-24 03:10:33,918 INFO [train.py:996] (3/4) Epoch 6, batch 15650, loss[loss=0.2127, simple_loss=0.2788, pruned_loss=0.07325, over 21599.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3014, pruned_loss=0.075, over 4271476.58 frames. ], batch size: 332, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:10:37,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1008738.0, ans=0.0 2023-06-24 03:10:39,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.465e+02 2.724e+02 3.048e+02 4.286e+02, threshold=5.447e+02, percent-clipped=0.0 2023-06-24 03:10:53,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1008738.0, ans=0.04949747468305833 2023-06-24 03:11:22,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1008858.0, ans=0.125 2023-06-24 03:11:56,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.96 vs. limit=6.0 2023-06-24 03:12:21,528 INFO [train.py:996] (3/4) Epoch 6, batch 15700, loss[loss=0.2018, simple_loss=0.2874, pruned_loss=0.05812, over 21623.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2971, pruned_loss=0.07372, over 4267581.63 frames. ], batch size: 247, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:12:47,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1009098.0, ans=0.125 2023-06-24 03:12:58,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1009098.0, ans=0.0 2023-06-24 03:13:14,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.06 vs. limit=10.0 2023-06-24 03:13:26,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1009218.0, ans=0.0 2023-06-24 03:14:08,907 INFO [train.py:996] (3/4) Epoch 6, batch 15750, loss[loss=0.2158, simple_loss=0.2854, pruned_loss=0.07309, over 21732.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2929, pruned_loss=0.07335, over 4276278.25 frames. ], batch size: 351, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:14:14,108 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.454e+02 2.677e+02 3.133e+02 4.467e+02, threshold=5.354e+02, percent-clipped=0.0 2023-06-24 03:14:17,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-24 03:15:57,583 INFO [train.py:996] (3/4) Epoch 6, batch 15800, loss[loss=0.23, simple_loss=0.2709, pruned_loss=0.09455, over 21320.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2884, pruned_loss=0.07327, over 4258199.52 frames. ], batch size: 507, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:16:02,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1009638.0, ans=0.0 2023-06-24 03:16:08,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1009638.0, ans=0.125 2023-06-24 03:16:26,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1009698.0, ans=0.125 2023-06-24 03:16:33,081 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:16:56,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1009758.0, ans=0.07 2023-06-24 03:17:00,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1009818.0, ans=0.0 2023-06-24 03:17:01,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1009818.0, ans=0.125 2023-06-24 03:17:45,425 INFO [train.py:996] (3/4) Epoch 6, batch 15850, loss[loss=0.2184, simple_loss=0.2833, pruned_loss=0.07675, over 21249.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2915, pruned_loss=0.07607, over 4269437.67 frames. ], batch size: 176, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:17:50,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 2.697e+02 2.988e+02 3.672e+02 5.659e+02, threshold=5.976e+02, percent-clipped=2.0 2023-06-24 03:18:01,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1009998.0, ans=0.2 2023-06-24 03:18:17,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1009998.0, ans=0.0 2023-06-24 03:19:02,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1010118.0, ans=0.0 2023-06-24 03:19:32,096 INFO [train.py:996] (3/4) Epoch 6, batch 15900, loss[loss=0.2261, simple_loss=0.3095, pruned_loss=0.07133, over 21520.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2887, pruned_loss=0.07546, over 4255137.04 frames. ], batch size: 389, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:19:40,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.67 vs. limit=22.5 2023-06-24 03:19:41,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1010238.0, ans=0.0 2023-06-24 03:20:20,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1010358.0, ans=0.125 2023-06-24 03:20:51,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1010418.0, ans=0.125 2023-06-24 03:20:54,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1010418.0, ans=0.1 2023-06-24 03:21:19,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-24 03:21:19,574 INFO [train.py:996] (3/4) Epoch 6, batch 15950, loss[loss=0.1686, simple_loss=0.2592, pruned_loss=0.03894, over 21524.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2899, pruned_loss=0.07284, over 4246223.70 frames. ], batch size: 211, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:21:20,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1010538.0, ans=0.1 2023-06-24 03:21:24,428 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.544e+02 2.251e+02 2.569e+02 3.023e+02 4.641e+02, threshold=5.138e+02, percent-clipped=0.0 2023-06-24 03:21:28,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1010538.0, ans=0.2 2023-06-24 03:21:49,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1010598.0, ans=0.125 2023-06-24 03:22:10,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1010658.0, ans=0.0 2023-06-24 03:22:46,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1010778.0, ans=0.0 2023-06-24 03:23:00,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1010778.0, ans=0.125 2023-06-24 03:23:07,216 INFO [train.py:996] (3/4) Epoch 6, batch 16000, loss[loss=0.2323, simple_loss=0.3094, pruned_loss=0.07758, over 20656.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.292, pruned_loss=0.07106, over 4259012.45 frames. ], batch size: 607, lr: 5.09e-03, grad_scale: 32.0 2023-06-24 03:23:28,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1010898.0, ans=0.125 2023-06-24 03:23:30,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1010898.0, ans=0.125 2023-06-24 03:24:04,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1010958.0, ans=0.0 2023-06-24 03:24:23,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1011018.0, ans=0.125 2023-06-24 03:24:52,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-24 03:24:55,885 INFO [train.py:996] (3/4) Epoch 6, batch 16050, loss[loss=0.2203, simple_loss=0.3028, pruned_loss=0.06892, over 20734.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.295, pruned_loss=0.06942, over 4261186.40 frames. ], batch size: 607, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:25:02,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.499e+02 2.877e+02 3.627e+02 5.675e+02, threshold=5.753e+02, percent-clipped=3.0 2023-06-24 03:25:08,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1011138.0, ans=0.125 2023-06-24 03:25:27,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1011198.0, ans=0.035 2023-06-24 03:25:59,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1011318.0, ans=0.0 2023-06-24 03:26:42,146 INFO [train.py:996] (3/4) Epoch 6, batch 16100, loss[loss=0.2438, simple_loss=0.3113, pruned_loss=0.08821, over 21760.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2994, pruned_loss=0.07072, over 4271106.97 frames. ], batch size: 389, lr: 5.09e-03, grad_scale: 16.0 2023-06-24 03:26:46,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1011438.0, ans=0.125 2023-06-24 03:27:05,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1011498.0, ans=0.125 2023-06-24 03:27:05,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.65 vs. limit=10.0 2023-06-24 03:28:01,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-24 03:28:03,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-24 03:28:31,502 INFO [train.py:996] (3/4) Epoch 6, batch 16150, loss[loss=0.2762, simple_loss=0.3282, pruned_loss=0.1121, over 21767.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3019, pruned_loss=0.07342, over 4277198.26 frames. ], batch size: 508, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:28:38,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.535e+02 2.977e+02 3.474e+02 6.271e+02, threshold=5.955e+02, percent-clipped=2.0 2023-06-24 03:29:08,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1011798.0, ans=0.125 2023-06-24 03:29:21,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1011858.0, ans=0.125 2023-06-24 03:29:21,871 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:29:53,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1011918.0, ans=0.0 2023-06-24 03:30:21,156 INFO [train.py:996] (3/4) Epoch 6, batch 16200, loss[loss=0.2586, simple_loss=0.333, pruned_loss=0.09212, over 21322.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3054, pruned_loss=0.07495, over 4282538.81 frames. ], batch size: 159, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:30:28,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1012038.0, ans=0.0 2023-06-24 03:30:39,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1012098.0, ans=0.0 2023-06-24 03:32:09,575 INFO [train.py:996] (3/4) Epoch 6, batch 16250, loss[loss=0.1526, simple_loss=0.2205, pruned_loss=0.0424, over 21789.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3042, pruned_loss=0.07569, over 4274112.73 frames. ], batch size: 102, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:32:10,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1012338.0, ans=0.125 2023-06-24 03:32:13,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1012338.0, ans=0.125 2023-06-24 03:32:16,272 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.579e+02 2.975e+02 3.411e+02 5.928e+02, threshold=5.950e+02, percent-clipped=0.0 2023-06-24 03:33:34,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=12.0 2023-06-24 03:33:40,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=15.0 2023-06-24 03:33:57,753 INFO [train.py:996] (3/4) Epoch 6, batch 16300, loss[loss=0.1886, simple_loss=0.2819, pruned_loss=0.04765, over 21445.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2977, pruned_loss=0.07134, over 4275917.40 frames. ], batch size: 211, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:34:08,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-24 03:34:33,592 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:34:42,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1012758.0, ans=0.125 2023-06-24 03:35:11,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1012818.0, ans=0.0 2023-06-24 03:35:47,986 INFO [train.py:996] (3/4) Epoch 6, batch 16350, loss[loss=0.192, simple_loss=0.2808, pruned_loss=0.05154, over 20799.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2962, pruned_loss=0.0723, over 4266306.52 frames. ], batch size: 609, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:36:00,043 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.290e+02 2.661e+02 3.043e+02 4.876e+02, threshold=5.321e+02, percent-clipped=0.0 2023-06-24 03:36:45,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1013058.0, ans=0.125 2023-06-24 03:37:09,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1013118.0, ans=0.2 2023-06-24 03:37:12,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1013118.0, ans=0.0 2023-06-24 03:37:29,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1013178.0, ans=0.125 2023-06-24 03:37:36,549 INFO [train.py:996] (3/4) Epoch 6, batch 16400, loss[loss=0.1939, simple_loss=0.2731, pruned_loss=0.05733, over 21808.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2992, pruned_loss=0.07319, over 4267081.65 frames. ], batch size: 282, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:38:08,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1013298.0, ans=0.015 2023-06-24 03:38:15,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1013298.0, ans=0.0 2023-06-24 03:38:22,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1013358.0, ans=0.1 2023-06-24 03:38:34,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1013358.0, ans=10.0 2023-06-24 03:38:38,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1013358.0, ans=0.0 2023-06-24 03:38:43,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1013418.0, ans=0.1 2023-06-24 03:39:08,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1013478.0, ans=0.125 2023-06-24 03:39:30,133 INFO [train.py:996] (3/4) Epoch 6, batch 16450, loss[loss=0.2084, simple_loss=0.2801, pruned_loss=0.06831, over 21747.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3, pruned_loss=0.07444, over 4269781.51 frames. ], batch size: 247, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:39:42,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.477e+02 2.722e+02 3.151e+02 4.827e+02, threshold=5.443e+02, percent-clipped=0.0 2023-06-24 03:39:54,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-24 03:41:26,347 INFO [train.py:996] (3/4) Epoch 6, batch 16500, loss[loss=0.1653, simple_loss=0.2229, pruned_loss=0.05386, over 21821.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2993, pruned_loss=0.07477, over 4271878.72 frames. ], batch size: 124, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:41:28,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1013838.0, ans=0.125 2023-06-24 03:42:13,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1013958.0, ans=0.1 2023-06-24 03:42:23,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1013958.0, ans=0.125 2023-06-24 03:42:49,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1014018.0, ans=0.125 2023-06-24 03:43:15,510 INFO [train.py:996] (3/4) Epoch 6, batch 16550, loss[loss=0.2375, simple_loss=0.3262, pruned_loss=0.07441, over 21728.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2977, pruned_loss=0.07308, over 4273459.10 frames. ], batch size: 441, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:43:22,448 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.591e+02 3.154e+02 3.856e+02 7.253e+02, threshold=6.309e+02, percent-clipped=4.0 2023-06-24 03:44:02,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.00 vs. limit=5.0 2023-06-24 03:44:08,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1014258.0, ans=0.125 2023-06-24 03:44:17,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1014258.0, ans=0.1 2023-06-24 03:44:57,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1014378.0, ans=0.2 2023-06-24 03:45:06,850 INFO [train.py:996] (3/4) Epoch 6, batch 16600, loss[loss=0.3656, simple_loss=0.4748, pruned_loss=0.1282, over 19794.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.306, pruned_loss=0.07622, over 4275689.65 frames. ], batch size: 702, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:46:27,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1014618.0, ans=0.125 2023-06-24 03:46:43,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1014678.0, ans=0.5 2023-06-24 03:46:59,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1014678.0, ans=0.125 2023-06-24 03:47:02,039 INFO [train.py:996] (3/4) Epoch 6, batch 16650, loss[loss=0.251, simple_loss=0.3258, pruned_loss=0.08814, over 21366.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.316, pruned_loss=0.07907, over 4280396.86 frames. ], batch size: 549, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:47:14,439 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.632e+02 2.959e+02 3.254e+02 5.416e+02, threshold=5.917e+02, percent-clipped=0.0 2023-06-24 03:47:15,077 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:48:59,617 INFO [train.py:996] (3/4) Epoch 6, batch 16700, loss[loss=0.2391, simple_loss=0.3475, pruned_loss=0.06535, over 20764.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.318, pruned_loss=0.07948, over 4281806.86 frames. ], batch size: 607, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:49:30,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1015098.0, ans=0.125 2023-06-24 03:49:32,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1015098.0, ans=0.035 2023-06-24 03:50:23,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1015218.0, ans=0.1 2023-06-24 03:50:26,417 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-24 03:50:55,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1015278.0, ans=15.0 2023-06-24 03:50:57,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1015338.0, ans=0.0 2023-06-24 03:50:58,284 INFO [train.py:996] (3/4) Epoch 6, batch 16750, loss[loss=0.261, simple_loss=0.3404, pruned_loss=0.09081, over 21795.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3203, pruned_loss=0.08197, over 4279236.75 frames. ], batch size: 124, lr: 5.08e-03, grad_scale: 16.0 2023-06-24 03:51:13,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.841e+02 3.113e+02 3.878e+02 5.035e+02, threshold=6.225e+02, percent-clipped=0.0 2023-06-24 03:51:20,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-24 03:52:41,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1015578.0, ans=0.1 2023-06-24 03:52:55,205 INFO [train.py:996] (3/4) Epoch 6, batch 16800, loss[loss=0.2667, simple_loss=0.3394, pruned_loss=0.09697, over 21767.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3233, pruned_loss=0.08175, over 4276170.18 frames. ], batch size: 441, lr: 5.08e-03, grad_scale: 32.0 2023-06-24 03:53:02,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1015638.0, ans=0.125 2023-06-24 03:53:45,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1015758.0, ans=0.125 2023-06-24 03:54:10,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.61 vs. limit=22.5 2023-06-24 03:54:17,513 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-24 03:54:29,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1015878.0, ans=0.1 2023-06-24 03:54:44,497 INFO [train.py:996] (3/4) Epoch 6, batch 16850, loss[loss=0.2683, simple_loss=0.3172, pruned_loss=0.1097, over 21844.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3196, pruned_loss=0.08151, over 4283222.64 frames. ], batch size: 508, lr: 5.07e-03, grad_scale: 32.0 2023-06-24 03:54:53,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.780e+02 3.302e+02 4.313e+02 7.428e+02, threshold=6.605e+02, percent-clipped=4.0 2023-06-24 03:55:12,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1015998.0, ans=0.125 2023-06-24 03:55:14,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1015998.0, ans=0.125 2023-06-24 03:55:22,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1015998.0, ans=0.125 2023-06-24 03:55:51,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1016118.0, ans=0.125 2023-06-24 03:55:51,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1016118.0, ans=0.125 2023-06-24 03:55:53,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.53 vs. limit=15.0 2023-06-24 03:56:32,150 INFO [train.py:996] (3/4) Epoch 6, batch 16900, loss[loss=0.188, simple_loss=0.2728, pruned_loss=0.0516, over 21612.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.315, pruned_loss=0.08076, over 4291736.26 frames. ], batch size: 263, lr: 5.07e-03, grad_scale: 32.0 2023-06-24 03:56:41,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1016238.0, ans=0.125 2023-06-24 03:56:51,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1016298.0, ans=0.1 2023-06-24 03:58:19,450 INFO [train.py:996] (3/4) Epoch 6, batch 16950, loss[loss=0.2059, simple_loss=0.2675, pruned_loss=0.07217, over 21184.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3076, pruned_loss=0.07904, over 4288345.09 frames. ], batch size: 608, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 03:58:28,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1016538.0, ans=0.125 2023-06-24 03:58:29,698 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.437e+02 2.853e+02 3.182e+02 4.700e+02, threshold=5.707e+02, percent-clipped=0.0 2023-06-24 03:58:33,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1016538.0, ans=0.125 2023-06-24 03:58:37,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-24 03:58:57,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1016598.0, ans=0.125 2023-06-24 03:59:35,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1016718.0, ans=0.2 2023-06-24 03:59:36,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1016718.0, ans=0.125 2023-06-24 04:00:03,691 INFO [train.py:996] (3/4) Epoch 6, batch 17000, loss[loss=0.226, simple_loss=0.2931, pruned_loss=0.07947, over 21914.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3039, pruned_loss=0.07906, over 4286997.71 frames. ], batch size: 351, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:00:36,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1016898.0, ans=0.2 2023-06-24 04:00:57,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1016958.0, ans=0.125 2023-06-24 04:01:26,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1017018.0, ans=0.125 2023-06-24 04:01:27,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1017018.0, ans=0.0 2023-06-24 04:01:54,098 INFO [train.py:996] (3/4) Epoch 6, batch 17050, loss[loss=0.2524, simple_loss=0.3408, pruned_loss=0.08198, over 21841.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3107, pruned_loss=0.0814, over 4288801.47 frames. ], batch size: 351, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:01:54,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1017138.0, ans=0.1 2023-06-24 04:02:01,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1017138.0, ans=0.125 2023-06-24 04:02:04,583 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 2.608e+02 3.012e+02 3.512e+02 5.895e+02, threshold=6.025e+02, percent-clipped=1.0 2023-06-24 04:02:05,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-24 04:02:28,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1017198.0, ans=0.0 2023-06-24 04:02:33,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1017258.0, ans=0.0 2023-06-24 04:02:45,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1017258.0, ans=10.0 2023-06-24 04:03:09,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1017318.0, ans=0.1 2023-06-24 04:03:30,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1017378.0, ans=0.125 2023-06-24 04:03:36,125 INFO [train.py:996] (3/4) Epoch 6, batch 17100, loss[loss=0.2121, simple_loss=0.281, pruned_loss=0.07164, over 21923.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3106, pruned_loss=0.0823, over 4296334.56 frames. ], batch size: 316, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:04:13,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1017498.0, ans=0.1 2023-06-24 04:05:23,947 INFO [train.py:996] (3/4) Epoch 6, batch 17150, loss[loss=0.1998, simple_loss=0.2872, pruned_loss=0.05615, over 21781.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3068, pruned_loss=0.08189, over 4302202.14 frames. ], batch size: 351, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:05:37,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1017738.0, ans=0.125 2023-06-24 04:05:44,964 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.872e+02 2.638e+02 2.899e+02 3.354e+02 4.965e+02, threshold=5.799e+02, percent-clipped=0.0 2023-06-24 04:06:44,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1017918.0, ans=0.125 2023-06-24 04:07:02,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1017978.0, ans=0.0 2023-06-24 04:07:11,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1018038.0, ans=0.125 2023-06-24 04:07:17,749 INFO [train.py:996] (3/4) Epoch 6, batch 17200, loss[loss=0.235, simple_loss=0.3126, pruned_loss=0.07867, over 21586.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3076, pruned_loss=0.08174, over 4296839.80 frames. ], batch size: 263, lr: 5.07e-03, grad_scale: 32.0 2023-06-24 04:07:45,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1018098.0, ans=0.5 2023-06-24 04:07:47,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1018098.0, ans=0.125 2023-06-24 04:07:59,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1018098.0, ans=0.125 2023-06-24 04:09:11,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1018338.0, ans=0.0 2023-06-24 04:09:12,588 INFO [train.py:996] (3/4) Epoch 6, batch 17250, loss[loss=0.2338, simple_loss=0.3132, pruned_loss=0.07727, over 21714.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3111, pruned_loss=0.08342, over 4295211.32 frames. ], batch size: 298, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:09:25,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.699e+02 3.105e+02 3.621e+02 5.993e+02, threshold=6.210e+02, percent-clipped=1.0 2023-06-24 04:09:36,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1018398.0, ans=0.1 2023-06-24 04:10:55,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1018578.0, ans=0.125 2023-06-24 04:11:01,902 INFO [train.py:996] (3/4) Epoch 6, batch 17300, loss[loss=0.2464, simple_loss=0.3281, pruned_loss=0.08238, over 21736.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3198, pruned_loss=0.08672, over 4293979.66 frames. ], batch size: 247, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:11:32,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1018698.0, ans=0.1 2023-06-24 04:11:37,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.02 vs. limit=22.5 2023-06-24 04:12:52,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=22.5 2023-06-24 04:12:53,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1018878.0, ans=0.125 2023-06-24 04:12:58,391 INFO [train.py:996] (3/4) Epoch 6, batch 17350, loss[loss=0.2553, simple_loss=0.3479, pruned_loss=0.08133, over 21481.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3203, pruned_loss=0.08575, over 4287693.76 frames. ], batch size: 471, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:13:16,104 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.820e+02 3.152e+02 3.644e+02 6.101e+02, threshold=6.303e+02, percent-clipped=0.0 2023-06-24 04:13:25,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1018998.0, ans=0.125 2023-06-24 04:13:41,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1019058.0, ans=0.0 2023-06-24 04:14:28,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1019178.0, ans=0.125 2023-06-24 04:14:54,376 INFO [train.py:996] (3/4) Epoch 6, batch 17400, loss[loss=0.2568, simple_loss=0.3552, pruned_loss=0.07923, over 21219.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3165, pruned_loss=0.08226, over 4283683.90 frames. ], batch size: 548, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:15:05,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1019238.0, ans=0.05 2023-06-24 04:15:12,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1019298.0, ans=0.07 2023-06-24 04:15:18,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=1019298.0, ans=15.0 2023-06-24 04:15:59,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1019418.0, ans=15.0 2023-06-24 04:16:02,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1019418.0, ans=0.1 2023-06-24 04:16:28,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1019478.0, ans=0.2 2023-06-24 04:16:44,089 INFO [train.py:996] (3/4) Epoch 6, batch 17450, loss[loss=0.193, simple_loss=0.2878, pruned_loss=0.04909, over 21722.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3126, pruned_loss=0.08002, over 4270688.75 frames. ], batch size: 332, lr: 5.07e-03, grad_scale: 16.0 2023-06-24 04:16:44,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1019538.0, ans=0.0 2023-06-24 04:16:58,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.752e+02 2.361e+02 2.755e+02 3.366e+02 5.958e+02, threshold=5.511e+02, percent-clipped=0.0 2023-06-24 04:17:36,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1019658.0, ans=0.0 2023-06-24 04:18:30,642 INFO [train.py:996] (3/4) Epoch 6, batch 17500, loss[loss=0.2573, simple_loss=0.3225, pruned_loss=0.09601, over 21815.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.307, pruned_loss=0.07703, over 4281710.04 frames. ], batch size: 112, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:18:42,222 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=12.0 2023-06-24 04:19:02,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1019898.0, ans=0.125 2023-06-24 04:19:14,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1019958.0, ans=0.125 2023-06-24 04:19:29,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-24 04:20:03,071 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:20:15,383 INFO [train.py:996] (3/4) Epoch 6, batch 17550, loss[loss=0.2099, simple_loss=0.3008, pruned_loss=0.05948, over 21786.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3075, pruned_loss=0.07624, over 4290363.36 frames. ], batch size: 332, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:20:28,585 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-06-24 04:20:28,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.730e+02 2.219e+02 2.535e+02 2.795e+02 4.245e+02, threshold=5.070e+02, percent-clipped=0.0 2023-06-24 04:20:31,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1020198.0, ans=0.125 2023-06-24 04:20:39,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1020198.0, ans=0.125 2023-06-24 04:20:39,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1020198.0, ans=0.0 2023-06-24 04:20:54,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1020258.0, ans=0.0 2023-06-24 04:21:04,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1020258.0, ans=0.125 2023-06-24 04:21:27,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1020318.0, ans=0.125 2023-06-24 04:21:40,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1020378.0, ans=0.125 2023-06-24 04:21:58,694 INFO [train.py:996] (3/4) Epoch 6, batch 17600, loss[loss=0.2469, simple_loss=0.3204, pruned_loss=0.08675, over 21737.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3109, pruned_loss=0.07666, over 4290449.08 frames. ], batch size: 298, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:23:07,068 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-24 04:23:20,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1020618.0, ans=0.015 2023-06-24 04:23:48,336 INFO [train.py:996] (3/4) Epoch 6, batch 17650, loss[loss=0.1412, simple_loss=0.1817, pruned_loss=0.05033, over 15911.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3071, pruned_loss=0.07628, over 4276626.55 frames. ], batch size: 61, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:24:13,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.840e+02 2.481e+02 3.096e+02 4.210e+02 8.151e+02, threshold=6.192e+02, percent-clipped=13.0 2023-06-24 04:25:12,214 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.81 vs. limit=15.0 2023-06-24 04:25:42,604 INFO [train.py:996] (3/4) Epoch 6, batch 17700, loss[loss=0.1874, simple_loss=0.2672, pruned_loss=0.05381, over 21377.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3023, pruned_loss=0.0737, over 4276148.04 frames. ], batch size: 131, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:25:50,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1021038.0, ans=0.0 2023-06-24 04:26:24,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1021098.0, ans=0.125 2023-06-24 04:26:26,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1021158.0, ans=0.1 2023-06-24 04:26:28,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1021158.0, ans=0.07 2023-06-24 04:26:37,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1021158.0, ans=0.025 2023-06-24 04:26:37,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1021158.0, ans=0.125 2023-06-24 04:26:39,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1021158.0, ans=0.07 2023-06-24 04:27:01,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1021278.0, ans=0.125 2023-06-24 04:27:07,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1021278.0, ans=0.0 2023-06-24 04:27:15,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1021278.0, ans=0.125 2023-06-24 04:27:30,941 INFO [train.py:996] (3/4) Epoch 6, batch 17750, loss[loss=0.2463, simple_loss=0.3341, pruned_loss=0.0793, over 21457.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3091, pruned_loss=0.07705, over 4281424.40 frames. ], batch size: 131, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:27:44,839 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.598e+02 3.053e+02 3.567e+02 5.587e+02, threshold=6.107e+02, percent-clipped=0.0 2023-06-24 04:29:03,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1021578.0, ans=0.0 2023-06-24 04:29:20,577 INFO [train.py:996] (3/4) Epoch 6, batch 17800, loss[loss=0.2294, simple_loss=0.306, pruned_loss=0.0764, over 21409.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3088, pruned_loss=0.0763, over 4283337.42 frames. ], batch size: 131, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:29:28,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-24 04:29:46,063 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-24 04:29:49,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1021698.0, ans=0.125 2023-06-24 04:29:50,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1021698.0, ans=0.2 2023-06-24 04:30:23,364 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.76 vs. limit=22.5 2023-06-24 04:30:52,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1021878.0, ans=0.1 2023-06-24 04:31:20,112 INFO [train.py:996] (3/4) Epoch 6, batch 17850, loss[loss=0.233, simple_loss=0.2959, pruned_loss=0.08505, over 20055.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3101, pruned_loss=0.07707, over 4276455.20 frames. ], batch size: 704, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:31:35,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.586e+02 3.040e+02 3.727e+02 6.886e+02, threshold=6.079e+02, percent-clipped=3.0 2023-06-24 04:31:56,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1022058.0, ans=0.0 2023-06-24 04:32:22,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1022118.0, ans=0.125 2023-06-24 04:32:22,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-24 04:33:10,796 INFO [train.py:996] (3/4) Epoch 6, batch 17900, loss[loss=0.2115, simple_loss=0.3027, pruned_loss=0.06019, over 21288.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3147, pruned_loss=0.07841, over 4273647.57 frames. ], batch size: 176, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:33:42,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1022298.0, ans=0.0 2023-06-24 04:34:42,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1022478.0, ans=0.5 2023-06-24 04:35:01,081 INFO [train.py:996] (3/4) Epoch 6, batch 17950, loss[loss=0.203, simple_loss=0.2879, pruned_loss=0.05905, over 21681.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3141, pruned_loss=0.07507, over 4280962.92 frames. ], batch size: 247, lr: 5.06e-03, grad_scale: 8.0 2023-06-24 04:35:03,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1022538.0, ans=0.1 2023-06-24 04:35:04,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-24 04:35:04,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1022538.0, ans=0.125 2023-06-24 04:35:13,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1022538.0, ans=0.125 2023-06-24 04:35:16,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.647e+02 2.348e+02 2.616e+02 3.044e+02 5.736e+02, threshold=5.233e+02, percent-clipped=0.0 2023-06-24 04:35:29,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-24 04:36:01,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1022718.0, ans=0.2 2023-06-24 04:36:47,684 INFO [train.py:996] (3/4) Epoch 6, batch 18000, loss[loss=0.2004, simple_loss=0.2622, pruned_loss=0.06925, over 21443.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3064, pruned_loss=0.07432, over 4278257.92 frames. ], batch size: 195, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:36:47,684 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 04:37:05,821 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2648, simple_loss=0.3617, pruned_loss=0.08394, over 1796401.00 frames. 2023-06-24 04:37:05,822 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23366MB 2023-06-24 04:37:16,254 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=22.5 2023-06-24 04:38:55,651 INFO [train.py:996] (3/4) Epoch 6, batch 18050, loss[loss=0.2635, simple_loss=0.3226, pruned_loss=0.1022, over 21355.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3002, pruned_loss=0.07335, over 4271669.12 frames. ], batch size: 471, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:39:01,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1023138.0, ans=0.125 2023-06-24 04:39:06,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1023138.0, ans=0.2 2023-06-24 04:39:22,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.791e+02 2.462e+02 2.761e+02 3.558e+02 5.314e+02, threshold=5.521e+02, percent-clipped=1.0 2023-06-24 04:39:29,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1023198.0, ans=0.125 2023-06-24 04:40:32,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1023378.0, ans=0.0 2023-06-24 04:40:36,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1023378.0, ans=0.125 2023-06-24 04:40:41,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1023378.0, ans=0.0 2023-06-24 04:40:45,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-24 04:40:46,685 INFO [train.py:996] (3/4) Epoch 6, batch 18100, loss[loss=0.2264, simple_loss=0.3099, pruned_loss=0.07148, over 21410.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3052, pruned_loss=0.07567, over 4263559.53 frames. ], batch size: 194, lr: 5.06e-03, grad_scale: 16.0 2023-06-24 04:41:11,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1023438.0, ans=0.125 2023-06-24 04:41:36,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=15.0 2023-06-24 04:42:14,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1023618.0, ans=0.125 2023-06-24 04:42:42,387 INFO [train.py:996] (3/4) Epoch 6, batch 18150, loss[loss=0.2202, simple_loss=0.3033, pruned_loss=0.06848, over 21800.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3075, pruned_loss=0.076, over 4270794.12 frames. ], batch size: 282, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:42:42,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1023738.0, ans=0.0 2023-06-24 04:42:58,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1023738.0, ans=0.1 2023-06-24 04:43:02,496 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.917e+02 2.411e+02 2.816e+02 3.524e+02 6.086e+02, threshold=5.632e+02, percent-clipped=3.0 2023-06-24 04:43:36,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1023858.0, ans=0.0 2023-06-24 04:43:48,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1023918.0, ans=0.025 2023-06-24 04:43:59,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-24 04:44:08,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1023978.0, ans=0.05 2023-06-24 04:44:22,783 INFO [train.py:996] (3/4) Epoch 6, batch 18200, loss[loss=0.1947, simple_loss=0.2636, pruned_loss=0.06289, over 21881.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3013, pruned_loss=0.0759, over 4277409.81 frames. ], batch size: 98, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:44:51,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1024098.0, ans=0.07 2023-06-24 04:45:36,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1024218.0, ans=0.035 2023-06-24 04:45:46,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1024218.0, ans=0.1 2023-06-24 04:46:04,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1024278.0, ans=0.04949747468305833 2023-06-24 04:46:07,511 INFO [train.py:996] (3/4) Epoch 6, batch 18250, loss[loss=0.1875, simple_loss=0.2553, pruned_loss=0.05984, over 21482.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2926, pruned_loss=0.07257, over 4276405.93 frames. ], batch size: 194, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:46:23,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.233e+02 2.540e+02 3.083e+02 5.311e+02, threshold=5.080e+02, percent-clipped=0.0 2023-06-24 04:47:43,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1024578.0, ans=0.1 2023-06-24 04:47:49,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1024578.0, ans=0.0 2023-06-24 04:47:57,589 INFO [train.py:996] (3/4) Epoch 6, batch 18300, loss[loss=0.2172, simple_loss=0.2914, pruned_loss=0.07151, over 21346.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2936, pruned_loss=0.07289, over 4279669.09 frames. ], batch size: 176, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:48:00,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-24 04:48:56,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1024758.0, ans=0.0 2023-06-24 04:49:05,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1024818.0, ans=0.125 2023-06-24 04:49:44,678 INFO [train.py:996] (3/4) Epoch 6, batch 18350, loss[loss=0.2304, simple_loss=0.3001, pruned_loss=0.08039, over 21339.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2996, pruned_loss=0.07252, over 4278298.13 frames. ], batch size: 471, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:50:00,360 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.650e+02 3.163e+02 4.128e+02 7.474e+02, threshold=6.326e+02, percent-clipped=9.0 2023-06-24 04:51:09,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1025118.0, ans=0.125 2023-06-24 04:51:34,314 INFO [train.py:996] (3/4) Epoch 6, batch 18400, loss[loss=0.1621, simple_loss=0.2646, pruned_loss=0.02977, over 20816.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2948, pruned_loss=0.07114, over 4277863.10 frames. ], batch size: 608, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:52:19,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1025358.0, ans=0.125 2023-06-24 04:53:15,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.04 vs. limit=12.0 2023-06-24 04:53:17,773 INFO [train.py:996] (3/4) Epoch 6, batch 18450, loss[loss=0.2089, simple_loss=0.2789, pruned_loss=0.06946, over 22013.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2901, pruned_loss=0.06732, over 4279719.33 frames. ], batch size: 103, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:53:33,311 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.665e+02 2.125e+02 2.326e+02 2.659e+02 4.995e+02, threshold=4.653e+02, percent-clipped=0.0 2023-06-24 04:53:54,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-24 04:54:29,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1025718.0, ans=0.95 2023-06-24 04:54:44,844 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:54:55,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1025778.0, ans=0.125 2023-06-24 04:55:06,494 INFO [train.py:996] (3/4) Epoch 6, batch 18500, loss[loss=0.218, simple_loss=0.2839, pruned_loss=0.076, over 21974.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2855, pruned_loss=0.06643, over 4267134.62 frames. ], batch size: 103, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:55:17,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1025838.0, ans=0.0 2023-06-24 04:55:37,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1025898.0, ans=0.125 2023-06-24 04:56:31,773 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.05 vs. limit=10.0 2023-06-24 04:56:52,796 INFO [train.py:996] (3/4) Epoch 6, batch 18550, loss[loss=0.1888, simple_loss=0.244, pruned_loss=0.06681, over 20704.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2832, pruned_loss=0.06555, over 4243636.06 frames. ], batch size: 608, lr: 5.05e-03, grad_scale: 32.0 2023-06-24 04:57:10,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.481e+02 2.781e+02 3.235e+02 5.250e+02, threshold=5.562e+02, percent-clipped=2.0 2023-06-24 04:58:41,473 INFO [train.py:996] (3/4) Epoch 6, batch 18600, loss[loss=0.1975, simple_loss=0.271, pruned_loss=0.06202, over 21510.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2825, pruned_loss=0.06679, over 4244642.43 frames. ], batch size: 230, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 04:58:58,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1026498.0, ans=0.125 2023-06-24 04:59:17,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1026498.0, ans=0.125 2023-06-24 04:59:31,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1026558.0, ans=0.2 2023-06-24 05:00:01,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1026618.0, ans=0.1 2023-06-24 05:00:26,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026678.0, ans=0.1 2023-06-24 05:00:29,907 INFO [train.py:996] (3/4) Epoch 6, batch 18650, loss[loss=0.217, simple_loss=0.28, pruned_loss=0.07697, over 20239.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2833, pruned_loss=0.06722, over 4254130.54 frames. ], batch size: 702, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 05:00:46,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.811e+02 2.410e+02 2.665e+02 3.233e+02 6.336e+02, threshold=5.330e+02, percent-clipped=1.0 2023-06-24 05:00:50,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1026798.0, ans=0.125 2023-06-24 05:00:57,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1026798.0, ans=0.1 2023-06-24 05:01:09,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-24 05:01:12,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1026798.0, ans=0.125 2023-06-24 05:01:48,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1026918.0, ans=0.0 2023-06-24 05:02:16,603 INFO [train.py:996] (3/4) Epoch 6, batch 18700, loss[loss=0.2077, simple_loss=0.2735, pruned_loss=0.07089, over 21818.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2816, pruned_loss=0.06909, over 4249723.98 frames. ], batch size: 298, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 05:02:24,380 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-24 05:02:32,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1027098.0, ans=0.125 2023-06-24 05:03:01,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1027158.0, ans=0.04949747468305833 2023-06-24 05:04:03,121 INFO [train.py:996] (3/4) Epoch 6, batch 18750, loss[loss=0.2499, simple_loss=0.328, pruned_loss=0.08594, over 21654.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2838, pruned_loss=0.07149, over 4263586.11 frames. ], batch size: 263, lr: 5.05e-03, grad_scale: 8.0 2023-06-24 05:04:22,088 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 2.422e+02 2.735e+02 3.202e+02 4.733e+02, threshold=5.471e+02, percent-clipped=0.0 2023-06-24 05:05:50,764 INFO [train.py:996] (3/4) Epoch 6, batch 18800, loss[loss=0.1908, simple_loss=0.2813, pruned_loss=0.05008, over 21657.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2914, pruned_loss=0.07315, over 4265274.13 frames. ], batch size: 247, lr: 5.05e-03, grad_scale: 16.0 2023-06-24 05:05:53,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1027638.0, ans=0.0 2023-06-24 05:06:01,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1027638.0, ans=0.0 2023-06-24 05:06:02,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-24 05:06:05,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1027638.0, ans=0.0 2023-06-24 05:06:12,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-24 05:06:15,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-24 05:06:51,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1027758.0, ans=0.0 2023-06-24 05:07:31,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1027878.0, ans=0.125 2023-06-24 05:07:38,448 INFO [train.py:996] (3/4) Epoch 6, batch 18850, loss[loss=0.2353, simple_loss=0.2956, pruned_loss=0.08753, over 20096.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.287, pruned_loss=0.0691, over 4258024.90 frames. ], batch size: 702, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:07:55,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1027998.0, ans=0.2 2023-06-24 05:07:57,082 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.497e+02 2.192e+02 2.570e+02 2.921e+02 4.536e+02, threshold=5.140e+02, percent-clipped=0.0 2023-06-24 05:09:11,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1028178.0, ans=0.125 2023-06-24 05:09:12,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1028178.0, ans=0.0 2023-06-24 05:09:14,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1028178.0, ans=0.125 2023-06-24 05:09:26,035 INFO [train.py:996] (3/4) Epoch 6, batch 18900, loss[loss=0.2105, simple_loss=0.2705, pruned_loss=0.07523, over 21523.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2838, pruned_loss=0.06887, over 4265653.68 frames. ], batch size: 442, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:09:39,161 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-24 05:10:45,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1028418.0, ans=0.0 2023-06-24 05:10:46,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1028418.0, ans=0.125 2023-06-24 05:11:14,520 INFO [train.py:996] (3/4) Epoch 6, batch 18950, loss[loss=0.2459, simple_loss=0.3078, pruned_loss=0.092, over 21779.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2856, pruned_loss=0.07143, over 4271135.99 frames. ], batch size: 441, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:11:31,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1028598.0, ans=0.2 2023-06-24 05:11:39,573 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 2.683e+02 3.004e+02 3.629e+02 6.368e+02, threshold=6.008e+02, percent-clipped=2.0 2023-06-24 05:11:55,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1028598.0, ans=0.125 2023-06-24 05:12:54,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1028778.0, ans=0.1 2023-06-24 05:13:05,427 INFO [train.py:996] (3/4) Epoch 6, batch 19000, loss[loss=0.2801, simple_loss=0.3502, pruned_loss=0.105, over 21337.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2957, pruned_loss=0.0736, over 4273190.37 frames. ], batch size: 143, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:13:34,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1028898.0, ans=0.125 2023-06-24 05:14:43,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1029078.0, ans=0.0 2023-06-24 05:14:54,034 INFO [train.py:996] (3/4) Epoch 6, batch 19050, loss[loss=0.2272, simple_loss=0.2923, pruned_loss=0.08098, over 21482.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2991, pruned_loss=0.07634, over 4273416.86 frames. ], batch size: 131, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:15:19,305 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.839e+02 3.291e+02 3.950e+02 6.159e+02, threshold=6.582e+02, percent-clipped=1.0 2023-06-24 05:15:20,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.20 vs. limit=12.0 2023-06-24 05:15:45,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1029258.0, ans=0.125 2023-06-24 05:16:00,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1029258.0, ans=0.2 2023-06-24 05:16:38,767 INFO [train.py:996] (3/4) Epoch 6, batch 19100, loss[loss=0.2051, simple_loss=0.2703, pruned_loss=0.06995, over 21601.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2985, pruned_loss=0.0778, over 4282337.05 frames. ], batch size: 263, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:16:39,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1029438.0, ans=0.125 2023-06-24 05:18:07,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1029618.0, ans=0.0 2023-06-24 05:18:08,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-24 05:18:19,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-24 05:18:36,314 INFO [train.py:996] (3/4) Epoch 6, batch 19150, loss[loss=0.2221, simple_loss=0.3006, pruned_loss=0.07186, over 21292.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2992, pruned_loss=0.07828, over 4279015.13 frames. ], batch size: 159, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:18:58,619 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:19:12,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 2.501e+02 2.737e+02 3.196e+02 5.229e+02, threshold=5.475e+02, percent-clipped=0.0 2023-06-24 05:20:26,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.97 vs. limit=22.5 2023-06-24 05:20:37,944 INFO [train.py:996] (3/4) Epoch 6, batch 19200, loss[loss=0.227, simple_loss=0.3268, pruned_loss=0.06363, over 21250.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3104, pruned_loss=0.07926, over 4277558.07 frames. ], batch size: 143, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:21:21,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1030158.0, ans=0.0 2023-06-24 05:21:29,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.07 vs. limit=15.0 2023-06-24 05:21:31,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1030158.0, ans=0.125 2023-06-24 05:22:06,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1030278.0, ans=0.2 2023-06-24 05:22:19,337 INFO [train.py:996] (3/4) Epoch 6, batch 19250, loss[loss=0.2074, simple_loss=0.2812, pruned_loss=0.06681, over 21296.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3096, pruned_loss=0.07408, over 4280068.52 frames. ], batch size: 143, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:22:40,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1030338.0, ans=0.1 2023-06-24 05:22:50,489 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.504e+02 2.125e+02 2.467e+02 2.912e+02 4.275e+02, threshold=4.933e+02, percent-clipped=0.0 2023-06-24 05:22:51,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1030398.0, ans=0.0 2023-06-24 05:22:53,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1030398.0, ans=0.1 2023-06-24 05:23:12,972 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-24 05:24:12,794 INFO [train.py:996] (3/4) Epoch 6, batch 19300, loss[loss=0.1954, simple_loss=0.277, pruned_loss=0.05691, over 21801.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3064, pruned_loss=0.07395, over 4283052.27 frames. ], batch size: 282, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:24:53,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1030758.0, ans=0.1 2023-06-24 05:25:06,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1030758.0, ans=0.125 2023-06-24 05:25:11,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1030818.0, ans=0.0 2023-06-24 05:25:14,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1030818.0, ans=0.125 2023-06-24 05:25:59,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1030878.0, ans=0.1 2023-06-24 05:26:02,598 INFO [train.py:996] (3/4) Epoch 6, batch 19350, loss[loss=0.2237, simple_loss=0.3138, pruned_loss=0.06683, over 21720.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.301, pruned_loss=0.07037, over 4275216.13 frames. ], batch size: 415, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:26:10,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1030938.0, ans=0.125 2023-06-24 05:26:28,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.277e+02 2.629e+02 3.333e+02 6.338e+02, threshold=5.259e+02, percent-clipped=7.0 2023-06-24 05:27:17,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1031118.0, ans=0.125 2023-06-24 05:27:50,209 INFO [train.py:996] (3/4) Epoch 6, batch 19400, loss[loss=0.2064, simple_loss=0.2702, pruned_loss=0.07132, over 21602.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3006, pruned_loss=0.06964, over 4272691.21 frames. ], batch size: 212, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:28:06,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1031238.0, ans=0.2 2023-06-24 05:28:15,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1031298.0, ans=0.0 2023-06-24 05:28:36,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1031358.0, ans=0.125 2023-06-24 05:28:47,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1031418.0, ans=0.1 2023-06-24 05:28:49,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1031418.0, ans=0.0 2023-06-24 05:28:51,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=15.0 2023-06-24 05:29:01,824 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:29:44,550 INFO [train.py:996] (3/4) Epoch 6, batch 19450, loss[loss=0.236, simple_loss=0.3087, pruned_loss=0.08158, over 14822.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2977, pruned_loss=0.0716, over 4272219.85 frames. ], batch size: 60, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:30:05,471 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 2.505e+02 2.907e+02 3.403e+02 7.011e+02, threshold=5.814e+02, percent-clipped=3.0 2023-06-24 05:30:48,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1031718.0, ans=0.125 2023-06-24 05:30:56,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.70 vs. limit=10.0 2023-06-24 05:31:14,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1031778.0, ans=0.125 2023-06-24 05:31:29,136 INFO [train.py:996] (3/4) Epoch 6, batch 19500, loss[loss=0.2284, simple_loss=0.3133, pruned_loss=0.07176, over 21161.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2935, pruned_loss=0.07255, over 4277556.40 frames. ], batch size: 548, lr: 5.04e-03, grad_scale: 16.0 2023-06-24 05:31:44,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1031838.0, ans=0.0 2023-06-24 05:33:17,760 INFO [train.py:996] (3/4) Epoch 6, batch 19550, loss[loss=0.1789, simple_loss=0.2437, pruned_loss=0.05709, over 21561.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2881, pruned_loss=0.07053, over 4281833.02 frames. ], batch size: 195, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:33:21,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1032138.0, ans=0.125 2023-06-24 05:33:37,933 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.721e+02 3.147e+02 3.714e+02 5.540e+02, threshold=6.293e+02, percent-clipped=0.0 2023-06-24 05:33:44,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1032198.0, ans=0.2 2023-06-24 05:33:50,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1032258.0, ans=0.1 2023-06-24 05:34:50,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.99 vs. limit=12.0 2023-06-24 05:35:04,136 INFO [train.py:996] (3/4) Epoch 6, batch 19600, loss[loss=0.2687, simple_loss=0.3334, pruned_loss=0.102, over 21280.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.289, pruned_loss=0.0712, over 4276278.84 frames. ], batch size: 143, lr: 5.03e-03, grad_scale: 32.0 2023-06-24 05:35:16,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1032438.0, ans=0.125 2023-06-24 05:35:23,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1032498.0, ans=0.0 2023-06-24 05:35:29,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1032498.0, ans=15.0 2023-06-24 05:35:44,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1032558.0, ans=0.125 2023-06-24 05:35:57,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1032558.0, ans=0.09899494936611666 2023-06-24 05:36:41,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1032678.0, ans=0.0 2023-06-24 05:36:52,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1032738.0, ans=0.0 2023-06-24 05:36:53,247 INFO [train.py:996] (3/4) Epoch 6, batch 19650, loss[loss=0.2274, simple_loss=0.2968, pruned_loss=0.07902, over 21815.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.294, pruned_loss=0.07462, over 4278145.70 frames. ], batch size: 351, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:36:57,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1032738.0, ans=0.04949747468305833 2023-06-24 05:37:16,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.599e+02 2.881e+02 3.237e+02 5.731e+02, threshold=5.762e+02, percent-clipped=0.0 2023-06-24 05:37:20,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1032798.0, ans=0.125 2023-06-24 05:37:23,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1032798.0, ans=0.1 2023-06-24 05:38:11,261 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-24 05:38:18,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1032918.0, ans=0.2 2023-06-24 05:38:40,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1032978.0, ans=0.125 2023-06-24 05:38:45,139 INFO [train.py:996] (3/4) Epoch 6, batch 19700, loss[loss=0.2015, simple_loss=0.291, pruned_loss=0.05601, over 21682.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2983, pruned_loss=0.07528, over 4274452.62 frames. ], batch size: 298, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:39:31,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1033158.0, ans=0.125 2023-06-24 05:40:32,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1033278.0, ans=0.07 2023-06-24 05:40:35,420 INFO [train.py:996] (3/4) Epoch 6, batch 19750, loss[loss=0.2154, simple_loss=0.2947, pruned_loss=0.06808, over 21759.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3078, pruned_loss=0.07691, over 4271035.41 frames. ], batch size: 124, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:41:09,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.723e+02 3.338e+02 4.190e+02 5.879e+02, threshold=6.676e+02, percent-clipped=1.0 2023-06-24 05:41:09,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1033398.0, ans=0.125 2023-06-24 05:41:30,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1033458.0, ans=0.125 2023-06-24 05:41:35,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1033458.0, ans=0.125 2023-06-24 05:42:22,737 INFO [train.py:996] (3/4) Epoch 6, batch 19800, loss[loss=0.2395, simple_loss=0.32, pruned_loss=0.0795, over 21484.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3064, pruned_loss=0.07741, over 4270081.33 frames. ], batch size: 471, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:42:35,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-24 05:42:41,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1033638.0, ans=0.125 2023-06-24 05:42:45,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1033698.0, ans=0.2 2023-06-24 05:42:51,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1033698.0, ans=0.125 2023-06-24 05:43:41,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1033818.0, ans=0.07 2023-06-24 05:43:57,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1033878.0, ans=0.125 2023-06-24 05:43:59,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1033878.0, ans=0.125 2023-06-24 05:44:13,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1033878.0, ans=0.125 2023-06-24 05:44:17,986 INFO [train.py:996] (3/4) Epoch 6, batch 19850, loss[loss=0.1782, simple_loss=0.2641, pruned_loss=0.04617, over 21717.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2985, pruned_loss=0.07264, over 4264997.55 frames. ], batch size: 332, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:44:42,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1033998.0, ans=0.0 2023-06-24 05:44:43,049 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-24 05:44:52,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.320e+02 2.642e+02 2.979e+02 5.130e+02, threshold=5.285e+02, percent-clipped=0.0 2023-06-24 05:44:55,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1033998.0, ans=0.125 2023-06-24 05:45:29,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1034118.0, ans=0.2 2023-06-24 05:45:40,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1034178.0, ans=0.125 2023-06-24 05:45:44,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1034178.0, ans=0.0 2023-06-24 05:46:03,636 INFO [train.py:996] (3/4) Epoch 6, batch 19900, loss[loss=0.2115, simple_loss=0.2812, pruned_loss=0.07089, over 21858.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2989, pruned_loss=0.07006, over 4259307.83 frames. ], batch size: 98, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:46:19,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1034238.0, ans=0.125 2023-06-24 05:47:18,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1034418.0, ans=0.125 2023-06-24 05:47:28,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1034478.0, ans=0.125 2023-06-24 05:47:30,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1034478.0, ans=0.125 2023-06-24 05:47:58,273 INFO [train.py:996] (3/4) Epoch 6, batch 19950, loss[loss=0.1938, simple_loss=0.2584, pruned_loss=0.06457, over 21189.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2923, pruned_loss=0.0696, over 4259993.24 frames. ], batch size: 143, lr: 5.03e-03, grad_scale: 8.0 2023-06-24 05:48:30,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1034598.0, ans=0.07 2023-06-24 05:48:33,572 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.795e+02 2.312e+02 2.767e+02 3.263e+02 6.271e+02, threshold=5.533e+02, percent-clipped=3.0 2023-06-24 05:49:09,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1034718.0, ans=0.1 2023-06-24 05:49:12,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1034718.0, ans=0.0 2023-06-24 05:49:19,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1034778.0, ans=0.1 2023-06-24 05:49:46,356 INFO [train.py:996] (3/4) Epoch 6, batch 20000, loss[loss=0.2308, simple_loss=0.3141, pruned_loss=0.07374, over 21871.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2937, pruned_loss=0.06967, over 4259651.56 frames. ], batch size: 351, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:50:35,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1034958.0, ans=0.125 2023-06-24 05:50:48,721 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.15 vs. limit=6.0 2023-06-24 05:51:33,366 INFO [train.py:996] (3/4) Epoch 6, batch 20050, loss[loss=0.2153, simple_loss=0.2877, pruned_loss=0.07146, over 21833.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2967, pruned_loss=0.07258, over 4269053.30 frames. ], batch size: 282, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:51:42,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1035138.0, ans=0.0 2023-06-24 05:51:51,381 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:52:03,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1035198.0, ans=0.125 2023-06-24 05:52:07,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1035198.0, ans=0.0 2023-06-24 05:52:08,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.657e+02 2.915e+02 3.243e+02 4.793e+02, threshold=5.831e+02, percent-clipped=0.0 2023-06-24 05:53:05,471 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.69 vs. limit=15.0 2023-06-24 05:53:16,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1035378.0, ans=0.0 2023-06-24 05:53:23,688 INFO [train.py:996] (3/4) Epoch 6, batch 20100, loss[loss=0.2261, simple_loss=0.322, pruned_loss=0.0651, over 21830.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2994, pruned_loss=0.07497, over 4280430.74 frames. ], batch size: 316, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:54:05,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1035498.0, ans=0.1 2023-06-24 05:54:42,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1035618.0, ans=0.07 2023-06-24 05:54:54,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1035678.0, ans=0.0 2023-06-24 05:54:58,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1035678.0, ans=0.0 2023-06-24 05:55:20,081 INFO [train.py:996] (3/4) Epoch 6, batch 20150, loss[loss=0.2443, simple_loss=0.3153, pruned_loss=0.08663, over 21665.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3101, pruned_loss=0.07877, over 4279370.57 frames. ], batch size: 263, lr: 5.03e-03, grad_scale: 16.0 2023-06-24 05:55:46,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.227e+02 2.880e+02 3.455e+02 4.017e+02 7.640e+02, threshold=6.911e+02, percent-clipped=4.0 2023-06-24 05:55:56,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1035858.0, ans=0.125 2023-06-24 05:56:40,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-24 05:57:01,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-24 05:57:12,810 INFO [train.py:996] (3/4) Epoch 6, batch 20200, loss[loss=0.2531, simple_loss=0.3326, pruned_loss=0.08684, over 21786.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3161, pruned_loss=0.08165, over 4277811.05 frames. ], batch size: 332, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 05:58:04,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1036158.0, ans=0.125 2023-06-24 05:58:25,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-24 05:59:01,882 INFO [train.py:996] (3/4) Epoch 6, batch 20250, loss[loss=0.2229, simple_loss=0.2919, pruned_loss=0.07691, over 16493.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3161, pruned_loss=0.07971, over 4268655.77 frames. ], batch size: 60, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 05:59:14,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1036338.0, ans=0.1 2023-06-24 05:59:26,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.473e+02 2.856e+02 3.579e+02 8.091e+02, threshold=5.711e+02, percent-clipped=1.0 2023-06-24 05:59:38,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1036398.0, ans=0.125 2023-06-24 05:59:39,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1036398.0, ans=0.125 2023-06-24 06:00:49,967 INFO [train.py:996] (3/4) Epoch 6, batch 20300, loss[loss=0.2296, simple_loss=0.3282, pruned_loss=0.06548, over 21267.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3132, pruned_loss=0.07659, over 4267261.26 frames. ], batch size: 548, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:02:02,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1036818.0, ans=0.2 2023-06-24 06:02:27,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1036878.0, ans=0.125 2023-06-24 06:02:32,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-06-24 06:02:33,237 INFO [train.py:996] (3/4) Epoch 6, batch 20350, loss[loss=0.2498, simple_loss=0.3212, pruned_loss=0.08923, over 21690.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3131, pruned_loss=0.07676, over 4263296.57 frames. ], batch size: 389, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:02:46,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-24 06:02:56,825 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.914e+02 2.320e+02 2.555e+02 2.973e+02 6.061e+02, threshold=5.110e+02, percent-clipped=1.0 2023-06-24 06:03:03,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1036998.0, ans=0.0 2023-06-24 06:03:21,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1037058.0, ans=0.0 2023-06-24 06:03:35,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1037058.0, ans=0.125 2023-06-24 06:04:05,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1037178.0, ans=0.1 2023-06-24 06:04:06,021 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.74 vs. limit=10.0 2023-06-24 06:04:20,796 INFO [train.py:996] (3/4) Epoch 6, batch 20400, loss[loss=0.2479, simple_loss=0.3434, pruned_loss=0.07624, over 19801.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3177, pruned_loss=0.08111, over 4266525.68 frames. ], batch size: 704, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:05:15,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1037358.0, ans=0.0 2023-06-24 06:05:22,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1037358.0, ans=0.05 2023-06-24 06:05:30,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1037418.0, ans=0.2 2023-06-24 06:05:55,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1037478.0, ans=0.125 2023-06-24 06:06:08,173 INFO [train.py:996] (3/4) Epoch 6, batch 20450, loss[loss=0.2289, simple_loss=0.2991, pruned_loss=0.07934, over 21485.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3173, pruned_loss=0.0823, over 4245698.48 frames. ], batch size: 194, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:06:09,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-24 06:06:18,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1037538.0, ans=0.125 2023-06-24 06:06:31,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.916e+02 3.328e+02 3.687e+02 5.878e+02, threshold=6.655e+02, percent-clipped=5.0 2023-06-24 06:06:34,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1037598.0, ans=0.125 2023-06-24 06:06:39,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1037598.0, ans=0.0 2023-06-24 06:07:30,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1037718.0, ans=0.125 2023-06-24 06:07:44,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1037778.0, ans=0.125 2023-06-24 06:07:44,952 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-24 06:07:54,268 INFO [train.py:996] (3/4) Epoch 6, batch 20500, loss[loss=0.224, simple_loss=0.2891, pruned_loss=0.07945, over 21454.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3124, pruned_loss=0.08237, over 4250506.92 frames. ], batch size: 131, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:09:12,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1038018.0, ans=0.0 2023-06-24 06:09:16,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1038018.0, ans=0.125 2023-06-24 06:09:17,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1038018.0, ans=0.0 2023-06-24 06:09:41,887 INFO [train.py:996] (3/4) Epoch 6, batch 20550, loss[loss=0.2288, simple_loss=0.3122, pruned_loss=0.07272, over 21847.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3054, pruned_loss=0.08054, over 4253218.04 frames. ], batch size: 372, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:10:06,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.636e+02 3.017e+02 3.648e+02 5.396e+02, threshold=6.035e+02, percent-clipped=0.0 2023-06-24 06:10:13,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1038198.0, ans=10.0 2023-06-24 06:10:56,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1038318.0, ans=0.125 2023-06-24 06:11:29,766 INFO [train.py:996] (3/4) Epoch 6, batch 20600, loss[loss=0.2204, simple_loss=0.2827, pruned_loss=0.07905, over 21360.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3069, pruned_loss=0.07864, over 4238422.09 frames. ], batch size: 176, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:11:59,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1038498.0, ans=0.125 2023-06-24 06:12:24,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1038558.0, ans=0.125 2023-06-24 06:12:30,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1038558.0, ans=0.0 2023-06-24 06:12:56,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1038678.0, ans=0.2 2023-06-24 06:13:10,685 INFO [train.py:996] (3/4) Epoch 6, batch 20650, loss[loss=0.2011, simple_loss=0.2613, pruned_loss=0.07041, over 21223.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3037, pruned_loss=0.07921, over 4248513.33 frames. ], batch size: 176, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:13:40,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.859e+02 2.453e+02 2.852e+02 3.486e+02 6.346e+02, threshold=5.704e+02, percent-clipped=1.0 2023-06-24 06:13:53,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1038798.0, ans=0.0 2023-06-24 06:14:23,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1038918.0, ans=0.07 2023-06-24 06:14:32,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1038918.0, ans=0.125 2023-06-24 06:14:45,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-24 06:15:00,367 INFO [train.py:996] (3/4) Epoch 6, batch 20700, loss[loss=0.1808, simple_loss=0.2558, pruned_loss=0.05291, over 21368.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2964, pruned_loss=0.07578, over 4248824.37 frames. ], batch size: 194, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:15:03,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1039038.0, ans=0.125 2023-06-24 06:15:04,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1039038.0, ans=0.125 2023-06-24 06:15:06,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1039038.0, ans=0.1 2023-06-24 06:16:49,940 INFO [train.py:996] (3/4) Epoch 6, batch 20750, loss[loss=0.2387, simple_loss=0.341, pruned_loss=0.06819, over 21697.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2991, pruned_loss=0.07525, over 4253593.65 frames. ], batch size: 298, lr: 5.02e-03, grad_scale: 16.0 2023-06-24 06:17:37,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.434e+02 2.945e+02 4.112e+02 9.661e+02, threshold=5.891e+02, percent-clipped=8.0 2023-06-24 06:17:39,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1039398.0, ans=0.0 2023-06-24 06:17:57,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1039458.0, ans=0.0 2023-06-24 06:18:21,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1039578.0, ans=0.0 2023-06-24 06:18:43,218 INFO [train.py:996] (3/4) Epoch 6, batch 20800, loss[loss=0.1923, simple_loss=0.2626, pruned_loss=0.06104, over 21664.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3015, pruned_loss=0.0756, over 4255830.98 frames. ], batch size: 333, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:19:33,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=22.5 2023-06-24 06:19:40,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1039758.0, ans=0.125 2023-06-24 06:19:47,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.43 vs. limit=22.5 2023-06-24 06:19:48,967 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:20:14,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1039878.0, ans=0.0 2023-06-24 06:20:26,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1039878.0, ans=0.2 2023-06-24 06:20:29,171 INFO [train.py:996] (3/4) Epoch 6, batch 20850, loss[loss=0.1841, simple_loss=0.2484, pruned_loss=0.05995, over 16609.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2934, pruned_loss=0.07344, over 4251391.33 frames. ], batch size: 60, lr: 5.02e-03, grad_scale: 32.0 2023-06-24 06:20:32,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1039938.0, ans=0.125 2023-06-24 06:21:06,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.402e+02 2.795e+02 3.449e+02 6.931e+02, threshold=5.589e+02, percent-clipped=4.0 2023-06-24 06:21:18,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1039998.0, ans=0.0 2023-06-24 06:22:05,750 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-24 06:22:18,911 INFO [train.py:996] (3/4) Epoch 6, batch 20900, loss[loss=0.218, simple_loss=0.294, pruned_loss=0.07102, over 21669.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2944, pruned_loss=0.0744, over 4256401.47 frames. ], batch size: 263, lr: 5.01e-03, grad_scale: 32.0 2023-06-24 06:22:52,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1040298.0, ans=0.95 2023-06-24 06:23:21,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1040358.0, ans=15.0 2023-06-24 06:23:38,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1040418.0, ans=0.2 2023-06-24 06:23:53,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1040478.0, ans=0.125 2023-06-24 06:24:04,672 INFO [train.py:996] (3/4) Epoch 6, batch 20950, loss[loss=0.2297, simple_loss=0.2926, pruned_loss=0.08338, over 21391.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2914, pruned_loss=0.07139, over 4263859.80 frames. ], batch size: 471, lr: 5.01e-03, grad_scale: 32.0 2023-06-24 06:24:40,106 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.258e+02 2.758e+02 3.294e+02 6.843e+02, threshold=5.516e+02, percent-clipped=1.0 2023-06-24 06:24:42,909 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-24 06:25:50,771 INFO [train.py:996] (3/4) Epoch 6, batch 21000, loss[loss=0.1707, simple_loss=0.2373, pruned_loss=0.05206, over 17030.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.29, pruned_loss=0.07109, over 4271528.78 frames. ], batch size: 66, lr: 5.01e-03, grad_scale: 32.0 2023-06-24 06:25:50,772 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 06:26:08,833 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2672, simple_loss=0.3654, pruned_loss=0.08451, over 1796401.00 frames. 2023-06-24 06:26:08,834 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23366MB 2023-06-24 06:26:40,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1040898.0, ans=0.2 2023-06-24 06:26:40,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1040898.0, ans=0.1 2023-06-24 06:26:55,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1040958.0, ans=0.125 2023-06-24 06:27:15,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1041018.0, ans=0.125 2023-06-24 06:27:31,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1041078.0, ans=0.0 2023-06-24 06:27:50,623 INFO [train.py:996] (3/4) Epoch 6, batch 21050, loss[loss=0.1925, simple_loss=0.2548, pruned_loss=0.06508, over 21222.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2876, pruned_loss=0.07154, over 4276895.87 frames. ], batch size: 548, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:28:23,270 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.469e+02 2.621e+02 3.007e+02 4.225e+02, threshold=5.242e+02, percent-clipped=0.0 2023-06-24 06:28:27,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1041198.0, ans=0.0 2023-06-24 06:28:41,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1041258.0, ans=0.125 2023-06-24 06:28:44,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1041258.0, ans=0.125 2023-06-24 06:29:17,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1041378.0, ans=0.125 2023-06-24 06:29:32,197 INFO [train.py:996] (3/4) Epoch 6, batch 21100, loss[loss=0.2069, simple_loss=0.2724, pruned_loss=0.07067, over 21725.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2843, pruned_loss=0.071, over 4272919.51 frames. ], batch size: 112, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:29:32,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1041438.0, ans=0.125 2023-06-24 06:29:45,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1041438.0, ans=0.125 2023-06-24 06:30:05,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1041498.0, ans=0.09899494936611666 2023-06-24 06:30:07,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.67 vs. limit=10.0 2023-06-24 06:30:09,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1041498.0, ans=0.2 2023-06-24 06:30:44,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1041618.0, ans=0.2 2023-06-24 06:31:20,307 INFO [train.py:996] (3/4) Epoch 6, batch 21150, loss[loss=0.2072, simple_loss=0.2829, pruned_loss=0.06573, over 15237.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2801, pruned_loss=0.07114, over 4260549.58 frames. ], batch size: 60, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:32:03,766 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.773e+02 2.519e+02 2.928e+02 4.378e+02 7.241e+02, threshold=5.856e+02, percent-clipped=12.0 2023-06-24 06:32:06,739 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-24 06:32:07,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1041798.0, ans=0.2 2023-06-24 06:32:16,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1041858.0, ans=0.07 2023-06-24 06:32:17,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1041858.0, ans=0.025 2023-06-24 06:32:44,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1041978.0, ans=0.125 2023-06-24 06:33:01,379 INFO [train.py:996] (3/4) Epoch 6, batch 21200, loss[loss=0.1783, simple_loss=0.2394, pruned_loss=0.05862, over 20731.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2765, pruned_loss=0.06995, over 4251723.93 frames. ], batch size: 608, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:34:23,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1042218.0, ans=0.125 2023-06-24 06:34:45,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.27 vs. limit=12.0 2023-06-24 06:34:49,494 INFO [train.py:996] (3/4) Epoch 6, batch 21250, loss[loss=0.1946, simple_loss=0.2613, pruned_loss=0.06402, over 21697.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2746, pruned_loss=0.06983, over 4239955.24 frames. ], batch size: 124, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:35:26,039 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:35:29,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1042398.0, ans=0.125 2023-06-24 06:35:33,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.602e+02 2.917e+02 3.308e+02 4.858e+02, threshold=5.834e+02, percent-clipped=0.0 2023-06-24 06:35:52,491 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:36:30,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=15.0 2023-06-24 06:36:36,696 INFO [train.py:996] (3/4) Epoch 6, batch 21300, loss[loss=0.2319, simple_loss=0.3052, pruned_loss=0.07932, over 21867.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2819, pruned_loss=0.07194, over 4253242.74 frames. ], batch size: 351, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:37:25,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1042698.0, ans=0.125 2023-06-24 06:37:33,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.06 vs. limit=6.0 2023-06-24 06:38:01,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1042818.0, ans=0.1 2023-06-24 06:38:11,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1042878.0, ans=10.0 2023-06-24 06:38:28,061 INFO [train.py:996] (3/4) Epoch 6, batch 21350, loss[loss=0.2087, simple_loss=0.3014, pruned_loss=0.05798, over 21831.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2876, pruned_loss=0.0729, over 4270529.03 frames. ], batch size: 316, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:38:57,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1042998.0, ans=0.125 2023-06-24 06:39:13,495 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.459e+02 2.698e+02 3.098e+02 4.551e+02, threshold=5.397e+02, percent-clipped=0.0 2023-06-24 06:40:27,125 INFO [train.py:996] (3/4) Epoch 6, batch 21400, loss[loss=0.2789, simple_loss=0.3476, pruned_loss=0.1051, over 21766.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2909, pruned_loss=0.07223, over 4275838.32 frames. ], batch size: 441, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:40:34,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1043238.0, ans=0.0 2023-06-24 06:41:41,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.05 vs. limit=10.0 2023-06-24 06:41:54,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1043478.0, ans=0.1 2023-06-24 06:42:15,509 INFO [train.py:996] (3/4) Epoch 6, batch 21450, loss[loss=0.2656, simple_loss=0.3216, pruned_loss=0.1048, over 21719.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2952, pruned_loss=0.07488, over 4281269.16 frames. ], batch size: 473, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:42:48,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1043598.0, ans=0.125 2023-06-24 06:42:49,430 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.545e+02 3.012e+02 3.537e+02 6.506e+02, threshold=6.024e+02, percent-clipped=2.0 2023-06-24 06:44:00,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1043838.0, ans=0.125 2023-06-24 06:44:02,126 INFO [train.py:996] (3/4) Epoch 6, batch 21500, loss[loss=0.2134, simple_loss=0.2811, pruned_loss=0.07282, over 21296.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2931, pruned_loss=0.07564, over 4281943.88 frames. ], batch size: 144, lr: 5.01e-03, grad_scale: 16.0 2023-06-24 06:45:09,815 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:45:50,184 INFO [train.py:996] (3/4) Epoch 6, batch 21550, loss[loss=0.1623, simple_loss=0.2309, pruned_loss=0.04685, over 21264.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2859, pruned_loss=0.07303, over 4279606.76 frames. ], batch size: 159, lr: 5.01e-03, grad_scale: 8.0 2023-06-24 06:46:26,732 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.541e+02 2.913e+02 3.487e+02 5.320e+02, threshold=5.826e+02, percent-clipped=0.0 2023-06-24 06:46:51,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.59 vs. limit=10.0 2023-06-24 06:46:53,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1044318.0, ans=0.125 2023-06-24 06:47:39,510 INFO [train.py:996] (3/4) Epoch 6, batch 21600, loss[loss=0.209, simple_loss=0.3034, pruned_loss=0.05731, over 21897.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2825, pruned_loss=0.07136, over 4278251.99 frames. ], batch size: 372, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:48:02,857 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.48 vs. limit=10.0 2023-06-24 06:48:13,278 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-24 06:48:19,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1044558.0, ans=0.2 2023-06-24 06:49:02,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1044618.0, ans=0.0 2023-06-24 06:49:13,706 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=22.5 2023-06-24 06:49:14,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1044678.0, ans=0.1 2023-06-24 06:49:23,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1044678.0, ans=0.1 2023-06-24 06:49:27,624 INFO [train.py:996] (3/4) Epoch 6, batch 21650, loss[loss=0.1912, simple_loss=0.249, pruned_loss=0.06672, over 20718.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2864, pruned_loss=0.06957, over 4273902.82 frames. ], batch size: 607, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:50:03,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 2.515e+02 2.797e+02 3.244e+02 5.540e+02, threshold=5.595e+02, percent-clipped=0.0 2023-06-24 06:50:15,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1044858.0, ans=0.125 2023-06-24 06:50:25,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1044918.0, ans=0.125 2023-06-24 06:50:25,938 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-24 06:50:49,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1044918.0, ans=0.125 2023-06-24 06:51:08,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1044978.0, ans=0.2 2023-06-24 06:51:13,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1045038.0, ans=0.0 2023-06-24 06:51:14,467 INFO [train.py:996] (3/4) Epoch 6, batch 21700, loss[loss=0.1955, simple_loss=0.2582, pruned_loss=0.06643, over 21302.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2865, pruned_loss=0.06712, over 4277616.39 frames. ], batch size: 144, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:51:16,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1045038.0, ans=0.125 2023-06-24 06:52:29,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.97 vs. limit=15.0 2023-06-24 06:53:01,224 INFO [train.py:996] (3/4) Epoch 6, batch 21750, loss[loss=0.1966, simple_loss=0.2677, pruned_loss=0.06275, over 21821.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2828, pruned_loss=0.06759, over 4273363.10 frames. ], batch size: 107, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:53:05,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1045338.0, ans=0.05 2023-06-24 06:53:07,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1045338.0, ans=0.0 2023-06-24 06:53:28,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=1045398.0, ans=12.0 2023-06-24 06:53:36,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1045398.0, ans=0.0 2023-06-24 06:53:37,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.726e+02 2.476e+02 2.744e+02 3.259e+02 4.826e+02, threshold=5.488e+02, percent-clipped=0.0 2023-06-24 06:53:45,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.69 vs. limit=15.0 2023-06-24 06:54:49,971 INFO [train.py:996] (3/4) Epoch 6, batch 21800, loss[loss=0.2501, simple_loss=0.3325, pruned_loss=0.08387, over 21654.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2812, pruned_loss=0.06896, over 4272737.60 frames. ], batch size: 391, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:55:17,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1045698.0, ans=0.125 2023-06-24 06:55:23,770 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-06-24 06:55:23,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-24 06:55:38,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1045758.0, ans=0.1 2023-06-24 06:56:39,664 INFO [train.py:996] (3/4) Epoch 6, batch 21850, loss[loss=0.2644, simple_loss=0.3319, pruned_loss=0.0985, over 21741.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2853, pruned_loss=0.06968, over 4274281.21 frames. ], batch size: 441, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:56:46,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-24 06:56:58,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1045938.0, ans=0.0 2023-06-24 06:57:16,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 2.510e+02 2.889e+02 3.463e+02 5.314e+02, threshold=5.778e+02, percent-clipped=0.0 2023-06-24 06:58:14,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1046178.0, ans=0.0 2023-06-24 06:58:14,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1046178.0, ans=0.0 2023-06-24 06:58:27,733 INFO [train.py:996] (3/4) Epoch 6, batch 21900, loss[loss=0.2074, simple_loss=0.2704, pruned_loss=0.0722, over 21708.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2878, pruned_loss=0.07068, over 4280146.02 frames. ], batch size: 264, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 06:59:31,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-24 07:00:21,989 INFO [train.py:996] (3/4) Epoch 6, batch 21950, loss[loss=0.1633, simple_loss=0.2349, pruned_loss=0.04582, over 21478.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.283, pruned_loss=0.07027, over 4282214.35 frames. ], batch size: 195, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:00:29,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1046538.0, ans=0.125 2023-06-24 07:00:53,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.690e+02 2.424e+02 2.913e+02 3.468e+02 5.833e+02, threshold=5.826e+02, percent-clipped=1.0 2023-06-24 07:01:22,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1046718.0, ans=0.125 2023-06-24 07:01:57,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1046778.0, ans=0.125 2023-06-24 07:02:10,401 INFO [train.py:996] (3/4) Epoch 6, batch 22000, loss[loss=0.1742, simple_loss=0.24, pruned_loss=0.05423, over 21571.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2767, pruned_loss=0.06704, over 4279914.24 frames. ], batch size: 263, lr: 5.00e-03, grad_scale: 32.0 2023-06-24 07:02:30,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1046898.0, ans=0.1 2023-06-24 07:03:05,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1046958.0, ans=0.07 2023-06-24 07:03:09,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-24 07:03:29,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1047018.0, ans=0.2 2023-06-24 07:03:31,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1047018.0, ans=0.125 2023-06-24 07:04:00,833 INFO [train.py:996] (3/4) Epoch 6, batch 22050, loss[loss=0.271, simple_loss=0.3733, pruned_loss=0.08432, over 19947.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.283, pruned_loss=0.06919, over 4269863.98 frames. ], batch size: 702, lr: 5.00e-03, grad_scale: 32.0 2023-06-24 07:04:22,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1047198.0, ans=0.2 2023-06-24 07:04:26,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1047198.0, ans=0.125 2023-06-24 07:04:30,134 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=12.0 2023-06-24 07:04:38,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1047198.0, ans=0.0 2023-06-24 07:04:39,705 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.688e+02 2.375e+02 2.787e+02 3.407e+02 5.897e+02, threshold=5.574e+02, percent-clipped=1.0 2023-06-24 07:05:16,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1047318.0, ans=0.125 2023-06-24 07:05:49,358 INFO [train.py:996] (3/4) Epoch 6, batch 22100, loss[loss=0.2361, simple_loss=0.3004, pruned_loss=0.08593, over 21640.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2937, pruned_loss=0.07388, over 4257771.49 frames. ], batch size: 263, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:05:55,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1047438.0, ans=0.0 2023-06-24 07:06:05,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1047498.0, ans=0.0 2023-06-24 07:06:09,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1047498.0, ans=0.1 2023-06-24 07:07:20,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1047678.0, ans=0.125 2023-06-24 07:07:32,045 INFO [train.py:996] (3/4) Epoch 6, batch 22150, loss[loss=0.2344, simple_loss=0.3079, pruned_loss=0.08042, over 20716.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2962, pruned_loss=0.07546, over 4264706.46 frames. ], batch size: 607, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:07:55,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.45 vs. limit=6.0 2023-06-24 07:07:57,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1047798.0, ans=0.125 2023-06-24 07:08:09,573 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:08:10,536 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.700e+02 3.228e+02 3.782e+02 5.741e+02, threshold=6.456e+02, percent-clipped=1.0 2023-06-24 07:08:53,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1047918.0, ans=0.1 2023-06-24 07:09:13,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-24 07:09:17,346 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-24 07:09:20,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1048038.0, ans=0.125 2023-06-24 07:09:21,388 INFO [train.py:996] (3/4) Epoch 6, batch 22200, loss[loss=0.2868, simple_loss=0.4053, pruned_loss=0.08419, over 19776.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2981, pruned_loss=0.07653, over 4275050.60 frames. ], batch size: 702, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:10:05,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1048158.0, ans=0.0 2023-06-24 07:10:48,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-24 07:11:09,186 INFO [train.py:996] (3/4) Epoch 6, batch 22250, loss[loss=0.2712, simple_loss=0.3531, pruned_loss=0.09471, over 21623.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3062, pruned_loss=0.07783, over 4279837.99 frames. ], batch size: 414, lr: 5.00e-03, grad_scale: 16.0 2023-06-24 07:11:46,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.521e+02 2.836e+02 3.368e+02 6.817e+02, threshold=5.671e+02, percent-clipped=1.0 2023-06-24 07:11:48,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1048458.0, ans=0.0 2023-06-24 07:11:50,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1048458.0, ans=0.2 2023-06-24 07:12:48,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-24 07:12:55,394 INFO [train.py:996] (3/4) Epoch 6, batch 22300, loss[loss=0.2195, simple_loss=0.3011, pruned_loss=0.06896, over 21423.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3081, pruned_loss=0.07984, over 4274800.94 frames. ], batch size: 131, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:12:59,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1048638.0, ans=0.125 2023-06-24 07:13:04,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1048638.0, ans=0.0 2023-06-24 07:13:10,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-24 07:14:38,222 INFO [train.py:996] (3/4) Epoch 6, batch 22350, loss[loss=0.2477, simple_loss=0.3136, pruned_loss=0.09088, over 21740.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3057, pruned_loss=0.08037, over 4281495.79 frames. ], batch size: 441, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:14:42,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1048938.0, ans=0.125 2023-06-24 07:14:49,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1048938.0, ans=0.125 2023-06-24 07:15:01,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1048998.0, ans=0.0 2023-06-24 07:15:09,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1048998.0, ans=0.125 2023-06-24 07:15:11,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1048998.0, ans=0.2 2023-06-24 07:15:15,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.647e+02 2.993e+02 3.483e+02 5.422e+02, threshold=5.987e+02, percent-clipped=0.0 2023-06-24 07:15:23,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1049058.0, ans=0.2 2023-06-24 07:15:24,978 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:15:33,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1049058.0, ans=0.2 2023-06-24 07:15:59,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1049118.0, ans=0.125 2023-06-24 07:16:10,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1049178.0, ans=0.2 2023-06-24 07:16:17,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1049178.0, ans=0.0 2023-06-24 07:16:20,239 INFO [train.py:996] (3/4) Epoch 6, batch 22400, loss[loss=0.215, simple_loss=0.2832, pruned_loss=0.0734, over 21391.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3011, pruned_loss=0.07707, over 4273869.11 frames. ], batch size: 177, lr: 4.99e-03, grad_scale: 32.0 2023-06-24 07:17:16,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1049358.0, ans=0.09899494936611666 2023-06-24 07:17:36,762 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.12 vs. limit=15.0 2023-06-24 07:18:07,111 INFO [train.py:996] (3/4) Epoch 6, batch 22450, loss[loss=0.2231, simple_loss=0.2818, pruned_loss=0.08218, over 21560.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2965, pruned_loss=0.07651, over 4273792.53 frames. ], batch size: 415, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:18:28,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1049598.0, ans=0.125 2023-06-24 07:18:32,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2023-06-24 07:18:52,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.523e+02 2.860e+02 3.590e+02 5.659e+02, threshold=5.720e+02, percent-clipped=0.0 2023-06-24 07:18:53,739 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-24 07:19:11,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1049658.0, ans=0.1 2023-06-24 07:19:50,736 INFO [train.py:996] (3/4) Epoch 6, batch 22500, loss[loss=0.2332, simple_loss=0.3254, pruned_loss=0.07048, over 21612.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2908, pruned_loss=0.07536, over 4279130.34 frames. ], batch size: 263, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:20:17,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1049898.0, ans=0.1 2023-06-24 07:20:48,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1049958.0, ans=0.0 2023-06-24 07:20:49,267 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-24 07:21:19,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1050018.0, ans=0.125 2023-06-24 07:21:19,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1050018.0, ans=0.1 2023-06-24 07:21:26,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1050078.0, ans=0.125 2023-06-24 07:21:40,277 INFO [train.py:996] (3/4) Epoch 6, batch 22550, loss[loss=0.2243, simple_loss=0.3056, pruned_loss=0.07147, over 21658.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.295, pruned_loss=0.076, over 4285889.30 frames. ], batch size: 263, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:22:00,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1050138.0, ans=0.2 2023-06-24 07:22:32,441 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.690e+02 3.328e+02 4.292e+02 7.428e+02, threshold=6.656e+02, percent-clipped=5.0 2023-06-24 07:22:54,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1050318.0, ans=0.2 2023-06-24 07:22:57,386 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-24 07:23:10,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1050378.0, ans=0.0 2023-06-24 07:23:14,943 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-24 07:23:30,365 INFO [train.py:996] (3/4) Epoch 6, batch 22600, loss[loss=0.1951, simple_loss=0.2502, pruned_loss=0.07002, over 21271.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.298, pruned_loss=0.07708, over 4286071.52 frames. ], batch size: 176, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:23:57,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1050498.0, ans=0.1 2023-06-24 07:24:44,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1050618.0, ans=0.125 2023-06-24 07:24:56,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1050678.0, ans=0.2 2023-06-24 07:25:23,774 INFO [train.py:996] (3/4) Epoch 6, batch 22650, loss[loss=0.2102, simple_loss=0.2683, pruned_loss=0.07601, over 21136.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2943, pruned_loss=0.07655, over 4280785.95 frames. ], batch size: 159, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:25:36,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1050738.0, ans=0.1 2023-06-24 07:26:03,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1050798.0, ans=0.0 2023-06-24 07:26:07,805 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.240e+02 2.713e+02 2.934e+02 3.379e+02 4.768e+02, threshold=5.868e+02, percent-clipped=0.0 2023-06-24 07:26:22,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1050858.0, ans=0.2 2023-06-24 07:26:22,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.98 vs. limit=15.0 2023-06-24 07:27:04,003 INFO [train.py:996] (3/4) Epoch 6, batch 22700, loss[loss=0.2725, simple_loss=0.2961, pruned_loss=0.1244, over 21517.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2881, pruned_loss=0.07555, over 4269217.69 frames. ], batch size: 512, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:27:53,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1051158.0, ans=0.0 2023-06-24 07:28:13,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.21 vs. limit=6.0 2023-06-24 07:28:56,562 INFO [train.py:996] (3/4) Epoch 6, batch 22750, loss[loss=0.2317, simple_loss=0.3076, pruned_loss=0.0779, over 21758.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2896, pruned_loss=0.07748, over 4275336.60 frames. ], batch size: 113, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:29:08,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1051338.0, ans=0.015 2023-06-24 07:29:25,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1051398.0, ans=0.0 2023-06-24 07:29:41,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.658e+02 2.967e+02 3.229e+02 5.067e+02, threshold=5.933e+02, percent-clipped=0.0 2023-06-24 07:30:27,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1051578.0, ans=0.2 2023-06-24 07:30:49,463 INFO [train.py:996] (3/4) Epoch 6, batch 22800, loss[loss=0.2249, simple_loss=0.3002, pruned_loss=0.0748, over 21482.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2924, pruned_loss=0.07946, over 4282489.40 frames. ], batch size: 177, lr: 4.99e-03, grad_scale: 32.0 2023-06-24 07:31:10,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1051698.0, ans=0.05 2023-06-24 07:31:27,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-24 07:32:31,112 INFO [train.py:996] (3/4) Epoch 6, batch 22850, loss[loss=0.2177, simple_loss=0.2804, pruned_loss=0.0775, over 21746.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2882, pruned_loss=0.07839, over 4281863.25 frames. ], batch size: 351, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:33:14,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.514e+02 2.935e+02 3.337e+02 4.796e+02, threshold=5.870e+02, percent-clipped=0.0 2023-06-24 07:33:22,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1052058.0, ans=0.0 2023-06-24 07:33:37,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1052118.0, ans=0.0 2023-06-24 07:33:40,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1052118.0, ans=0.125 2023-06-24 07:34:22,982 INFO [train.py:996] (3/4) Epoch 6, batch 22900, loss[loss=0.2258, simple_loss=0.3246, pruned_loss=0.06346, over 21611.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2925, pruned_loss=0.07731, over 4282795.18 frames. ], batch size: 263, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:34:23,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1052238.0, ans=0.125 2023-06-24 07:34:33,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-24 07:35:30,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1052418.0, ans=0.0 2023-06-24 07:35:58,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1052478.0, ans=0.125 2023-06-24 07:36:14,139 INFO [train.py:996] (3/4) Epoch 6, batch 22950, loss[loss=0.2366, simple_loss=0.3475, pruned_loss=0.06282, over 21374.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3072, pruned_loss=0.0753, over 4274811.63 frames. ], batch size: 211, lr: 4.99e-03, grad_scale: 16.0 2023-06-24 07:36:37,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1052598.0, ans=0.1 2023-06-24 07:36:52,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=15.0 2023-06-24 07:36:55,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1052658.0, ans=0.1 2023-06-24 07:36:56,201 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.951e+02 2.392e+02 2.733e+02 3.196e+02 4.909e+02, threshold=5.466e+02, percent-clipped=0.0 2023-06-24 07:37:03,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1052658.0, ans=0.125 2023-06-24 07:37:17,624 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:37:40,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1052718.0, ans=0.1 2023-06-24 07:38:02,377 INFO [train.py:996] (3/4) Epoch 6, batch 23000, loss[loss=0.206, simple_loss=0.2853, pruned_loss=0.06339, over 21640.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3055, pruned_loss=0.07373, over 4280172.70 frames. ], batch size: 263, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:38:54,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1052958.0, ans=0.125 2023-06-24 07:39:00,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1052958.0, ans=0.1 2023-06-24 07:39:35,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1053078.0, ans=0.04949747468305833 2023-06-24 07:39:58,073 INFO [train.py:996] (3/4) Epoch 6, batch 23050, loss[loss=0.2319, simple_loss=0.3043, pruned_loss=0.07975, over 21465.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3067, pruned_loss=0.07554, over 4280681.48 frames. ], batch size: 194, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:39:59,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-24 07:40:32,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1053198.0, ans=0.0 2023-06-24 07:40:41,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.622e+02 2.848e+02 3.330e+02 6.770e+02, threshold=5.696e+02, percent-clipped=1.0 2023-06-24 07:41:32,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1053378.0, ans=0.1 2023-06-24 07:41:42,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1053378.0, ans=0.125 2023-06-24 07:41:48,439 INFO [train.py:996] (3/4) Epoch 6, batch 23100, loss[loss=0.1997, simple_loss=0.2643, pruned_loss=0.06755, over 21118.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3015, pruned_loss=0.07597, over 4275647.80 frames. ], batch size: 159, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:42:22,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=22.5 2023-06-24 07:42:27,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1053558.0, ans=0.04949747468305833 2023-06-24 07:42:27,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1053558.0, ans=0.1 2023-06-24 07:43:08,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1053618.0, ans=0.1 2023-06-24 07:43:09,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.19 vs. limit=15.0 2023-06-24 07:43:31,897 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-24 07:43:35,984 INFO [train.py:996] (3/4) Epoch 6, batch 23150, loss[loss=0.2215, simple_loss=0.2942, pruned_loss=0.07435, over 21620.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2955, pruned_loss=0.07529, over 4281993.26 frames. ], batch size: 389, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:43:39,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1053738.0, ans=0.1 2023-06-24 07:43:41,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1053738.0, ans=0.125 2023-06-24 07:44:16,352 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.510e+02 2.867e+02 3.300e+02 5.681e+02, threshold=5.734e+02, percent-clipped=0.0 2023-06-24 07:45:09,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1053978.0, ans=0.04949747468305833 2023-06-24 07:45:15,929 INFO [train.py:996] (3/4) Epoch 6, batch 23200, loss[loss=0.2681, simple_loss=0.3261, pruned_loss=0.1051, over 21631.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2946, pruned_loss=0.0761, over 4289116.45 frames. ], batch size: 471, lr: 4.98e-03, grad_scale: 32.0 2023-06-24 07:45:39,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1054098.0, ans=0.125 2023-06-24 07:46:25,539 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-24 07:46:26,875 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-24 07:47:02,628 INFO [train.py:996] (3/4) Epoch 6, batch 23250, loss[loss=0.2369, simple_loss=0.2918, pruned_loss=0.09095, over 21579.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2955, pruned_loss=0.0777, over 4284752.41 frames. ], batch size: 548, lr: 4.98e-03, grad_scale: 32.0 2023-06-24 07:47:36,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-24 07:47:44,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1054398.0, ans=0.125 2023-06-24 07:47:46,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1054398.0, ans=6.0 2023-06-24 07:47:56,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.641e+02 2.998e+02 3.541e+02 5.576e+02, threshold=5.996e+02, percent-clipped=0.0 2023-06-24 07:48:09,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1054458.0, ans=0.125 2023-06-24 07:48:22,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.78 vs. limit=10.0 2023-06-24 07:48:50,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1054578.0, ans=0.0 2023-06-24 07:48:55,930 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-24 07:48:57,998 INFO [train.py:996] (3/4) Epoch 6, batch 23300, loss[loss=0.2165, simple_loss=0.2695, pruned_loss=0.08178, over 20122.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3045, pruned_loss=0.07989, over 4286333.44 frames. ], batch size: 703, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:49:06,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-06-24 07:49:11,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1054638.0, ans=0.035 2023-06-24 07:50:36,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1054878.0, ans=0.125 2023-06-24 07:50:46,417 INFO [train.py:996] (3/4) Epoch 6, batch 23350, loss[loss=0.2344, simple_loss=0.3036, pruned_loss=0.08259, over 20000.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3081, pruned_loss=0.07844, over 4272146.09 frames. ], batch size: 702, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:51:41,617 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.696e+02 2.545e+02 3.075e+02 3.480e+02 4.848e+02, threshold=6.150e+02, percent-clipped=0.0 2023-06-24 07:52:34,874 INFO [train.py:996] (3/4) Epoch 6, batch 23400, loss[loss=0.2075, simple_loss=0.2798, pruned_loss=0.0676, over 21682.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3017, pruned_loss=0.07555, over 4268404.85 frames. ], batch size: 263, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:54:01,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1055418.0, ans=0.0 2023-06-24 07:54:33,571 INFO [train.py:996] (3/4) Epoch 6, batch 23450, loss[loss=0.2415, simple_loss=0.3058, pruned_loss=0.08862, over 21330.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3027, pruned_loss=0.07783, over 4278604.02 frames. ], batch size: 548, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:55:17,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.530e+02 2.834e+02 3.227e+02 5.088e+02, threshold=5.668e+02, percent-clipped=0.0 2023-06-24 07:55:22,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1055658.0, ans=0.125 2023-06-24 07:55:33,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1055718.0, ans=0.125 2023-06-24 07:56:20,855 INFO [train.py:996] (3/4) Epoch 6, batch 23500, loss[loss=0.2383, simple_loss=0.3027, pruned_loss=0.08697, over 21805.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3025, pruned_loss=0.07955, over 4280831.68 frames. ], batch size: 414, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:56:51,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1055898.0, ans=0.0 2023-06-24 07:56:52,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1055898.0, ans=0.0 2023-06-24 07:56:59,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1055958.0, ans=0.125 2023-06-24 07:57:14,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1055958.0, ans=0.0 2023-06-24 07:57:41,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1056078.0, ans=0.1 2023-06-24 07:57:45,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1056078.0, ans=0.0 2023-06-24 07:58:09,306 INFO [train.py:996] (3/4) Epoch 6, batch 23550, loss[loss=0.2238, simple_loss=0.271, pruned_loss=0.08826, over 21647.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2967, pruned_loss=0.07899, over 4272615.42 frames. ], batch size: 416, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 07:58:12,353 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-24 07:58:18,705 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:58:39,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1056198.0, ans=0.125 2023-06-24 07:58:41,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-24 07:58:52,551 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.611e+02 2.905e+02 3.629e+02 5.861e+02, threshold=5.811e+02, percent-clipped=1.0 2023-06-24 07:59:10,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1056318.0, ans=0.0 2023-06-24 07:59:26,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1056318.0, ans=0.125 2023-06-24 07:59:57,765 INFO [train.py:996] (3/4) Epoch 6, batch 23600, loss[loss=0.2472, simple_loss=0.3143, pruned_loss=0.09003, over 21373.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2967, pruned_loss=0.07824, over 4274087.68 frames. ], batch size: 549, lr: 4.98e-03, grad_scale: 32.0 2023-06-24 08:00:33,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1056498.0, ans=0.125 2023-06-24 08:00:45,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1056558.0, ans=0.125 2023-06-24 08:01:51,989 INFO [train.py:996] (3/4) Epoch 6, batch 23650, loss[loss=0.2527, simple_loss=0.3309, pruned_loss=0.08731, over 21842.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2987, pruned_loss=0.07686, over 4274499.27 frames. ], batch size: 118, lr: 4.98e-03, grad_scale: 16.0 2023-06-24 08:02:13,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1056798.0, ans=0.0 2023-06-24 08:02:38,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.630e+02 3.092e+02 3.541e+02 6.593e+02, threshold=6.183e+02, percent-clipped=1.0 2023-06-24 08:03:01,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1056918.0, ans=0.125 2023-06-24 08:03:11,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1056918.0, ans=0.125 2023-06-24 08:03:40,790 INFO [train.py:996] (3/4) Epoch 6, batch 23700, loss[loss=0.2333, simple_loss=0.3105, pruned_loss=0.07806, over 21794.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3004, pruned_loss=0.07657, over 4271565.24 frames. ], batch size: 124, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:03:49,223 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-24 08:04:08,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1057098.0, ans=0.0 2023-06-24 08:04:09,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1057098.0, ans=0.125 2023-06-24 08:04:38,821 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:05:31,877 INFO [train.py:996] (3/4) Epoch 6, batch 23750, loss[loss=0.2016, simple_loss=0.2969, pruned_loss=0.0532, over 21738.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.304, pruned_loss=0.07645, over 4271658.59 frames. ], batch size: 298, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:05:48,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1057398.0, ans=0.2 2023-06-24 08:06:26,810 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.851e+02 2.305e+02 2.862e+02 3.715e+02 6.571e+02, threshold=5.724e+02, percent-clipped=1.0 2023-06-24 08:06:29,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1057458.0, ans=0.125 2023-06-24 08:06:51,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1057518.0, ans=0.1 2023-06-24 08:07:18,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1057578.0, ans=0.09899494936611666 2023-06-24 08:07:21,353 INFO [train.py:996] (3/4) Epoch 6, batch 23800, loss[loss=0.2362, simple_loss=0.3124, pruned_loss=0.08001, over 21432.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3016, pruned_loss=0.07419, over 4268239.31 frames. ], batch size: 194, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:07:25,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1057638.0, ans=0.0 2023-06-24 08:07:41,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-24 08:08:33,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1057758.0, ans=0.125 2023-06-24 08:09:18,085 INFO [train.py:996] (3/4) Epoch 6, batch 23850, loss[loss=0.2611, simple_loss=0.3327, pruned_loss=0.09474, over 21594.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.309, pruned_loss=0.07593, over 4262009.61 frames. ], batch size: 389, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:10:14,730 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.790e+02 3.206e+02 3.794e+02 6.982e+02, threshold=6.412e+02, percent-clipped=2.0 2023-06-24 08:10:17,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1058058.0, ans=0.125 2023-06-24 08:10:24,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.83 vs. limit=22.5 2023-06-24 08:11:12,040 INFO [train.py:996] (3/4) Epoch 6, batch 23900, loss[loss=0.2355, simple_loss=0.3162, pruned_loss=0.07743, over 16320.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3165, pruned_loss=0.07872, over 4247647.94 frames. ], batch size: 60, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:11:14,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1058238.0, ans=0.0 2023-06-24 08:11:23,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1058238.0, ans=0.0 2023-06-24 08:11:40,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1058298.0, ans=0.05 2023-06-24 08:12:03,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1058358.0, ans=0.0 2023-06-24 08:12:04,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2023-06-24 08:12:21,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1058418.0, ans=0.015 2023-06-24 08:12:53,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1058478.0, ans=0.0 2023-06-24 08:13:00,270 INFO [train.py:996] (3/4) Epoch 6, batch 23950, loss[loss=0.2031, simple_loss=0.2778, pruned_loss=0.0642, over 21744.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3091, pruned_loss=0.07793, over 4246265.98 frames. ], batch size: 282, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:13:52,878 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.675e+02 3.021e+02 3.458e+02 5.557e+02, threshold=6.041e+02, percent-clipped=0.0 2023-06-24 08:14:00,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1058658.0, ans=0.1 2023-06-24 08:14:51,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1058778.0, ans=0.125 2023-06-24 08:14:55,868 INFO [train.py:996] (3/4) Epoch 6, batch 24000, loss[loss=0.2927, simple_loss=0.376, pruned_loss=0.1047, over 21780.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3115, pruned_loss=0.08166, over 4251190.33 frames. ], batch size: 118, lr: 4.97e-03, grad_scale: 32.0 2023-06-24 08:14:55,868 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 08:15:17,154 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2634, simple_loss=0.3603, pruned_loss=0.08319, over 1796401.00 frames. 2023-06-24 08:15:17,156 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23366MB 2023-06-24 08:15:27,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1058838.0, ans=0.1 2023-06-24 08:16:23,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1059018.0, ans=0.125 2023-06-24 08:16:30,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1059018.0, ans=0.125 2023-06-24 08:16:33,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-24 08:17:08,159 INFO [train.py:996] (3/4) Epoch 6, batch 24050, loss[loss=0.1942, simple_loss=0.281, pruned_loss=0.05366, over 21458.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3128, pruned_loss=0.08191, over 4255511.38 frames. ], batch size: 194, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:17:11,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.67 vs. limit=15.0 2023-06-24 08:17:33,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1059198.0, ans=0.125 2023-06-24 08:17:48,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1059258.0, ans=0.0 2023-06-24 08:17:56,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.625e+02 3.028e+02 3.764e+02 6.671e+02, threshold=6.056e+02, percent-clipped=1.0 2023-06-24 08:18:37,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1059378.0, ans=0.125 2023-06-24 08:18:37,930 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:18:59,223 INFO [train.py:996] (3/4) Epoch 6, batch 24100, loss[loss=0.2759, simple_loss=0.3595, pruned_loss=0.09619, over 21731.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3136, pruned_loss=0.08054, over 4260923.60 frames. ], batch size: 441, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:19:10,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1059438.0, ans=0.125 2023-06-24 08:19:26,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1059498.0, ans=0.1 2023-06-24 08:19:44,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1059558.0, ans=0.0 2023-06-24 08:20:11,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1059618.0, ans=0.1 2023-06-24 08:20:49,102 INFO [train.py:996] (3/4) Epoch 6, batch 24150, loss[loss=0.2628, simple_loss=0.315, pruned_loss=0.1053, over 21629.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.312, pruned_loss=0.08163, over 4269859.55 frames. ], batch size: 471, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:20:57,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1059738.0, ans=0.025 2023-06-24 08:21:11,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1059798.0, ans=0.0 2023-06-24 08:21:43,497 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.678e+02 3.013e+02 3.443e+02 5.621e+02, threshold=6.026e+02, percent-clipped=0.0 2023-06-24 08:22:09,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.63 vs. limit=12.0 2023-06-24 08:22:12,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1059918.0, ans=0.125 2023-06-24 08:22:40,808 INFO [train.py:996] (3/4) Epoch 6, batch 24200, loss[loss=0.2249, simple_loss=0.3105, pruned_loss=0.06962, over 21770.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3143, pruned_loss=0.08341, over 4272322.76 frames. ], batch size: 282, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:22:43,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1060038.0, ans=0.0 2023-06-24 08:22:56,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1060038.0, ans=0.125 2023-06-24 08:23:16,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1060098.0, ans=0.125 2023-06-24 08:23:36,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1060158.0, ans=0.0 2023-06-24 08:24:27,613 INFO [train.py:996] (3/4) Epoch 6, batch 24250, loss[loss=0.2044, simple_loss=0.3168, pruned_loss=0.04607, over 21217.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3114, pruned_loss=0.07685, over 4272041.09 frames. ], batch size: 548, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:24:56,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1060398.0, ans=0.125 2023-06-24 08:24:57,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=10.0 2023-06-24 08:25:00,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1060398.0, ans=0.04949747468305833 2023-06-24 08:25:22,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1060458.0, ans=0.2 2023-06-24 08:25:25,856 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.522e+02 2.253e+02 2.770e+02 3.370e+02 5.813e+02, threshold=5.539e+02, percent-clipped=0.0 2023-06-24 08:25:55,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1060518.0, ans=0.05 2023-06-24 08:26:15,733 INFO [train.py:996] (3/4) Epoch 6, batch 24300, loss[loss=0.2333, simple_loss=0.3086, pruned_loss=0.07897, over 21569.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3029, pruned_loss=0.07062, over 4275083.17 frames. ], batch size: 507, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:27:41,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1060878.0, ans=0.125 2023-06-24 08:27:54,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1060878.0, ans=0.05 2023-06-24 08:28:09,170 INFO [train.py:996] (3/4) Epoch 6, batch 24350, loss[loss=0.2632, simple_loss=0.3193, pruned_loss=0.1035, over 21721.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.299, pruned_loss=0.07062, over 4275348.86 frames. ], batch size: 473, lr: 4.97e-03, grad_scale: 16.0 2023-06-24 08:28:10,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.43 vs. limit=12.0 2023-06-24 08:28:26,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-24 08:29:01,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.594e+02 2.610e+02 2.946e+02 3.475e+02 5.631e+02, threshold=5.892e+02, percent-clipped=1.0 2023-06-24 08:29:58,768 INFO [train.py:996] (3/4) Epoch 6, batch 24400, loss[loss=0.2432, simple_loss=0.3159, pruned_loss=0.08528, over 21557.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3042, pruned_loss=0.07446, over 4272939.23 frames. ], batch size: 389, lr: 4.97e-03, grad_scale: 32.0 2023-06-24 08:30:22,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1061298.0, ans=0.125 2023-06-24 08:30:53,546 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-06-24 08:31:49,144 INFO [train.py:996] (3/4) Epoch 6, batch 24450, loss[loss=0.2635, simple_loss=0.3572, pruned_loss=0.08488, over 21642.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3083, pruned_loss=0.07649, over 4273562.33 frames. ], batch size: 441, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:32:15,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1061598.0, ans=0.125 2023-06-24 08:32:38,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1061658.0, ans=0.125 2023-06-24 08:32:41,893 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.780e+02 3.190e+02 3.668e+02 5.575e+02, threshold=6.380e+02, percent-clipped=0.0 2023-06-24 08:32:51,971 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-24 08:33:25,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1061778.0, ans=0.125 2023-06-24 08:33:29,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-24 08:33:37,482 INFO [train.py:996] (3/4) Epoch 6, batch 24500, loss[loss=0.2362, simple_loss=0.2962, pruned_loss=0.08811, over 20208.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3083, pruned_loss=0.07621, over 4281833.69 frames. ], batch size: 707, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:33:38,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1061838.0, ans=0.125 2023-06-24 08:33:53,173 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-24 08:34:53,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-24 08:34:58,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1062018.0, ans=0.125 2023-06-24 08:35:28,560 INFO [train.py:996] (3/4) Epoch 6, batch 24550, loss[loss=0.2423, simple_loss=0.3232, pruned_loss=0.08068, over 21935.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3098, pruned_loss=0.07803, over 4284112.93 frames. ], batch size: 372, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:35:34,620 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:36:05,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1062198.0, ans=0.2 2023-06-24 08:36:16,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.83 vs. limit=10.0 2023-06-24 08:36:18,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.580e+02 2.942e+02 3.468e+02 6.882e+02, threshold=5.884e+02, percent-clipped=1.0 2023-06-24 08:36:59,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1062378.0, ans=0.125 2023-06-24 08:37:18,649 INFO [train.py:996] (3/4) Epoch 6, batch 24600, loss[loss=0.1969, simple_loss=0.259, pruned_loss=0.06739, over 21480.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.306, pruned_loss=0.07858, over 4269576.00 frames. ], batch size: 132, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:37:19,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=15.0 2023-06-24 08:37:21,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1062438.0, ans=0.1 2023-06-24 08:37:40,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1062498.0, ans=0.125 2023-06-24 08:37:42,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1062498.0, ans=15.0 2023-06-24 08:38:37,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1062618.0, ans=0.125 2023-06-24 08:38:58,620 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-24 08:39:14,750 INFO [train.py:996] (3/4) Epoch 6, batch 24650, loss[loss=0.2023, simple_loss=0.264, pruned_loss=0.07029, over 21753.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2983, pruned_loss=0.07672, over 4267530.26 frames. ], batch size: 300, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:39:37,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1062798.0, ans=0.125 2023-06-24 08:39:44,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1062798.0, ans=0.125 2023-06-24 08:40:03,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.707e+02 3.176e+02 3.617e+02 5.573e+02, threshold=6.353e+02, percent-clipped=0.0 2023-06-24 08:41:03,579 INFO [train.py:996] (3/4) Epoch 6, batch 24700, loss[loss=0.2386, simple_loss=0.3684, pruned_loss=0.05438, over 19894.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2969, pruned_loss=0.07524, over 4267713.74 frames. ], batch size: 702, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:41:41,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1063158.0, ans=0.125 2023-06-24 08:42:51,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1063338.0, ans=0.125 2023-06-24 08:42:52,505 INFO [train.py:996] (3/4) Epoch 6, batch 24750, loss[loss=0.1838, simple_loss=0.2428, pruned_loss=0.06239, over 21501.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2907, pruned_loss=0.07305, over 4270550.46 frames. ], batch size: 230, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:43:17,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1063398.0, ans=0.125 2023-06-24 08:43:41,618 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.438e+02 2.880e+02 3.643e+02 9.109e+02, threshold=5.760e+02, percent-clipped=1.0 2023-06-24 08:44:00,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1063518.0, ans=0.0 2023-06-24 08:44:36,338 INFO [train.py:996] (3/4) Epoch 6, batch 24800, loss[loss=0.2054, simple_loss=0.273, pruned_loss=0.0689, over 21910.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2856, pruned_loss=0.07285, over 4279983.87 frames. ], batch size: 316, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:45:09,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1063698.0, ans=0.125 2023-06-24 08:45:13,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1063698.0, ans=0.2 2023-06-24 08:46:01,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1063818.0, ans=0.125 2023-06-24 08:46:26,847 INFO [train.py:996] (3/4) Epoch 6, batch 24850, loss[loss=0.2205, simple_loss=0.2852, pruned_loss=0.0779, over 20211.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2862, pruned_loss=0.0741, over 4277214.88 frames. ], batch size: 702, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:46:48,229 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-24 08:47:10,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1064058.0, ans=0.0 2023-06-24 08:47:21,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.836e+02 3.370e+02 3.940e+02 7.201e+02, threshold=6.739e+02, percent-clipped=1.0 2023-06-24 08:47:32,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1064058.0, ans=0.025 2023-06-24 08:47:32,496 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:48:09,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1064178.0, ans=0.0 2023-06-24 08:48:21,627 INFO [train.py:996] (3/4) Epoch 6, batch 24900, loss[loss=0.2394, simple_loss=0.3124, pruned_loss=0.08317, over 21949.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2906, pruned_loss=0.0754, over 4283580.53 frames. ], batch size: 316, lr: 4.96e-03, grad_scale: 32.0 2023-06-24 08:48:38,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1064298.0, ans=0.1 2023-06-24 08:49:05,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1064358.0, ans=0.04949747468305833 2023-06-24 08:49:19,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1064358.0, ans=0.125 2023-06-24 08:50:14,170 INFO [train.py:996] (3/4) Epoch 6, batch 24950, loss[loss=0.3053, simple_loss=0.363, pruned_loss=0.1238, over 21405.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.2989, pruned_loss=0.07919, over 4281139.63 frames. ], batch size: 471, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:51:10,924 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:51:12,087 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 2.867e+02 3.295e+02 3.992e+02 6.156e+02, threshold=6.590e+02, percent-clipped=0.0 2023-06-24 08:51:20,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1064658.0, ans=0.09899494936611666 2023-06-24 08:52:06,475 INFO [train.py:996] (3/4) Epoch 6, batch 25000, loss[loss=0.2205, simple_loss=0.2845, pruned_loss=0.07823, over 21282.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3055, pruned_loss=0.08127, over 4271430.73 frames. ], batch size: 549, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:52:17,260 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:53:09,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1064958.0, ans=0.125 2023-06-24 08:53:54,480 INFO [train.py:996] (3/4) Epoch 6, batch 25050, loss[loss=0.2268, simple_loss=0.2777, pruned_loss=0.08794, over 21514.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2974, pruned_loss=0.07918, over 4271848.22 frames. ], batch size: 441, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:54:56,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.544e+02 2.890e+02 3.638e+02 5.399e+02, threshold=5.780e+02, percent-clipped=0.0 2023-06-24 08:54:56,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1065258.0, ans=0.0 2023-06-24 08:54:58,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1065258.0, ans=0.0 2023-06-24 08:55:06,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-24 08:55:34,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1065378.0, ans=0.125 2023-06-24 08:55:39,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1065378.0, ans=0.2 2023-06-24 08:55:44,174 INFO [train.py:996] (3/4) Epoch 6, batch 25100, loss[loss=0.2264, simple_loss=0.3059, pruned_loss=0.07344, over 21264.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2907, pruned_loss=0.07739, over 4273495.53 frames. ], batch size: 176, lr: 4.96e-03, grad_scale: 16.0 2023-06-24 08:56:45,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1065558.0, ans=0.0 2023-06-24 08:56:45,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1065558.0, ans=0.0 2023-06-24 08:56:47,396 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-24 08:57:31,179 INFO [train.py:996] (3/4) Epoch 6, batch 25150, loss[loss=0.2148, simple_loss=0.2981, pruned_loss=0.06573, over 21355.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.295, pruned_loss=0.07558, over 4278548.42 frames. ], batch size: 159, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 08:57:47,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1065798.0, ans=0.1 2023-06-24 08:58:27,068 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.368e+02 2.837e+02 3.510e+02 8.139e+02, threshold=5.674e+02, percent-clipped=4.0 2023-06-24 08:59:00,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1065978.0, ans=0.125 2023-06-24 08:59:14,323 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 08:59:20,739 INFO [train.py:996] (3/4) Epoch 6, batch 25200, loss[loss=0.2103, simple_loss=0.3011, pruned_loss=0.05969, over 21723.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2939, pruned_loss=0.07283, over 4275465.94 frames. ], batch size: 298, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 08:59:24,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1066038.0, ans=0.0 2023-06-24 08:59:58,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1066098.0, ans=0.125 2023-06-24 09:00:31,236 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-24 09:01:08,292 INFO [train.py:996] (3/4) Epoch 6, batch 25250, loss[loss=0.2134, simple_loss=0.2795, pruned_loss=0.07365, over 21744.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2928, pruned_loss=0.07161, over 4281874.49 frames. ], batch size: 317, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:01:38,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1066398.0, ans=0.07 2023-06-24 09:01:51,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1066398.0, ans=0.1 2023-06-24 09:01:58,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-24 09:02:12,212 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.424e+02 2.718e+02 3.085e+02 4.421e+02, threshold=5.437e+02, percent-clipped=0.0 2023-06-24 09:02:58,769 INFO [train.py:996] (3/4) Epoch 6, batch 25300, loss[loss=0.223, simple_loss=0.3015, pruned_loss=0.07229, over 21434.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2901, pruned_loss=0.07143, over 4280507.68 frames. ], batch size: 211, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:03:22,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1066698.0, ans=0.125 2023-06-24 09:03:27,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1066698.0, ans=0.125 2023-06-24 09:03:45,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1066758.0, ans=0.125 2023-06-24 09:04:08,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=15.0 2023-06-24 09:04:48,207 INFO [train.py:996] (3/4) Epoch 6, batch 25350, loss[loss=0.1836, simple_loss=0.2688, pruned_loss=0.04916, over 21382.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2921, pruned_loss=0.0711, over 4270207.05 frames. ], batch size: 211, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:05:01,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1066938.0, ans=0.125 2023-06-24 09:05:50,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.518e+02 2.873e+02 3.506e+02 6.244e+02, threshold=5.746e+02, percent-clipped=2.0 2023-06-24 09:06:28,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1067178.0, ans=0.125 2023-06-24 09:06:35,387 INFO [train.py:996] (3/4) Epoch 6, batch 25400, loss[loss=0.2124, simple_loss=0.2872, pruned_loss=0.0688, over 21677.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2871, pruned_loss=0.06972, over 4259730.19 frames. ], batch size: 298, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:06:54,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1067238.0, ans=0.125 2023-06-24 09:07:32,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1067358.0, ans=0.0 2023-06-24 09:08:04,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1067418.0, ans=0.95 2023-06-24 09:08:25,232 INFO [train.py:996] (3/4) Epoch 6, batch 25450, loss[loss=0.1799, simple_loss=0.2594, pruned_loss=0.05023, over 20709.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2877, pruned_loss=0.07015, over 4253380.71 frames. ], batch size: 607, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:09:10,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1067658.0, ans=0.125 2023-06-24 09:09:16,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1067658.0, ans=0.125 2023-06-24 09:09:30,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.400e+02 2.613e+02 3.023e+02 4.754e+02, threshold=5.227e+02, percent-clipped=0.0 2023-06-24 09:10:00,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1067778.0, ans=0.125 2023-06-24 09:10:23,331 INFO [train.py:996] (3/4) Epoch 6, batch 25500, loss[loss=0.2156, simple_loss=0.3009, pruned_loss=0.06511, over 21732.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2901, pruned_loss=0.06847, over 4259884.78 frames. ], batch size: 332, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:10:47,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1067898.0, ans=0.125 2023-06-24 09:11:24,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1067958.0, ans=0.0 2023-06-24 09:11:29,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1068018.0, ans=0.0 2023-06-24 09:11:29,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1068018.0, ans=0.2 2023-06-24 09:12:14,518 INFO [train.py:996] (3/4) Epoch 6, batch 25550, loss[loss=0.2161, simple_loss=0.3133, pruned_loss=0.05941, over 21714.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2951, pruned_loss=0.0683, over 4248811.71 frames. ], batch size: 332, lr: 4.95e-03, grad_scale: 16.0 2023-06-24 09:12:15,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1068138.0, ans=0.125 2023-06-24 09:13:01,546 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-24 09:13:20,092 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.387e+02 2.706e+02 3.316e+02 5.632e+02, threshold=5.413e+02, percent-clipped=1.0 2023-06-24 09:14:05,900 INFO [train.py:996] (3/4) Epoch 6, batch 25600, loss[loss=0.2933, simple_loss=0.3556, pruned_loss=0.1155, over 21484.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3005, pruned_loss=0.0696, over 4253023.06 frames. ], batch size: 471, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:14:18,346 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-24 09:14:37,102 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-24 09:14:47,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-24 09:15:32,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1068618.0, ans=0.125 2023-06-24 09:15:32,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1068618.0, ans=0.0 2023-06-24 09:15:41,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1068678.0, ans=0.0 2023-06-24 09:16:00,288 INFO [train.py:996] (3/4) Epoch 6, batch 25650, loss[loss=0.2378, simple_loss=0.3274, pruned_loss=0.07409, over 19937.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3014, pruned_loss=0.07192, over 4252191.38 frames. ], batch size: 702, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:16:31,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-24 09:16:56,692 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.182e+02 2.676e+02 3.048e+02 3.761e+02 7.606e+02, threshold=6.096e+02, percent-clipped=4.0 2023-06-24 09:17:06,521 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-24 09:17:13,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-24 09:17:27,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1068978.0, ans=0.0 2023-06-24 09:17:41,413 INFO [train.py:996] (3/4) Epoch 6, batch 25700, loss[loss=0.2328, simple_loss=0.3123, pruned_loss=0.07668, over 21883.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2983, pruned_loss=0.07337, over 4263703.80 frames. ], batch size: 316, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:17:55,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1069038.0, ans=0.125 2023-06-24 09:18:50,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2023-06-24 09:18:53,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1069218.0, ans=0.5 2023-06-24 09:19:26,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1069278.0, ans=0.0 2023-06-24 09:19:34,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1069278.0, ans=0.125 2023-06-24 09:19:36,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1069278.0, ans=0.09899494936611666 2023-06-24 09:19:39,957 INFO [train.py:996] (3/4) Epoch 6, batch 25750, loss[loss=0.3752, simple_loss=0.444, pruned_loss=0.1532, over 21460.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3035, pruned_loss=0.07652, over 4249271.74 frames. ], batch size: 508, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:20:43,888 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.640e+02 3.088e+02 3.573e+02 6.081e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-24 09:20:45,372 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=15.0 2023-06-24 09:21:06,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.97 vs. limit=10.0 2023-06-24 09:21:09,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1069518.0, ans=0.0 2023-06-24 09:21:24,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1069578.0, ans=0.125 2023-06-24 09:21:25,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-24 09:21:25,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.67 vs. limit=22.5 2023-06-24 09:21:42,017 INFO [train.py:996] (3/4) Epoch 6, batch 25800, loss[loss=0.2721, simple_loss=0.3486, pruned_loss=0.0978, over 21311.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3135, pruned_loss=0.08145, over 4250889.38 frames. ], batch size: 143, lr: 4.95e-03, grad_scale: 32.0 2023-06-24 09:22:17,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1069698.0, ans=0.0 2023-06-24 09:22:31,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069758.0, ans=0.1 2023-06-24 09:22:31,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1069758.0, ans=0.125 2023-06-24 09:23:04,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1069818.0, ans=0.2 2023-06-24 09:23:30,439 INFO [train.py:996] (3/4) Epoch 6, batch 25850, loss[loss=0.2615, simple_loss=0.3334, pruned_loss=0.09482, over 21839.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.316, pruned_loss=0.08121, over 4252598.20 frames. ], batch size: 118, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:23:37,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.65 vs. limit=10.0 2023-06-24 09:23:45,949 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-24 09:23:52,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1069998.0, ans=22.5 2023-06-24 09:24:29,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.685e+02 2.967e+02 3.484e+02 6.005e+02, threshold=5.935e+02, percent-clipped=0.0 2023-06-24 09:24:47,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1070118.0, ans=0.1 2023-06-24 09:25:16,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1070178.0, ans=0.125 2023-06-24 09:25:21,152 INFO [train.py:996] (3/4) Epoch 6, batch 25900, loss[loss=0.2693, simple_loss=0.3563, pruned_loss=0.09117, over 21710.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3173, pruned_loss=0.08146, over 4260038.32 frames. ], batch size: 247, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:25:58,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.68 vs. limit=15.0 2023-06-24 09:26:03,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1070298.0, ans=0.125 2023-06-24 09:27:06,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1070478.0, ans=0.125 2023-06-24 09:27:16,106 INFO [train.py:996] (3/4) Epoch 6, batch 25950, loss[loss=0.2337, simple_loss=0.3161, pruned_loss=0.07568, over 21847.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3249, pruned_loss=0.08489, over 4258584.48 frames. ], batch size: 316, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:27:20,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1070538.0, ans=0.2 2023-06-24 09:27:27,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1070538.0, ans=0.125 2023-06-24 09:28:19,397 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:28:20,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 2.613e+02 2.969e+02 3.394e+02 6.568e+02, threshold=5.938e+02, percent-clipped=2.0 2023-06-24 09:28:30,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1070718.0, ans=0.0 2023-06-24 09:29:03,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1070778.0, ans=0.0 2023-06-24 09:29:05,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1070838.0, ans=0.0 2023-06-24 09:29:06,457 INFO [train.py:996] (3/4) Epoch 6, batch 26000, loss[loss=0.2531, simple_loss=0.3289, pruned_loss=0.0887, over 21293.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3238, pruned_loss=0.08237, over 4256287.87 frames. ], batch size: 549, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:30:30,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1071018.0, ans=0.0 2023-06-24 09:30:49,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-24 09:31:00,839 INFO [train.py:996] (3/4) Epoch 6, batch 26050, loss[loss=0.2207, simple_loss=0.2875, pruned_loss=0.07698, over 21920.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3238, pruned_loss=0.08445, over 4264260.92 frames. ], batch size: 351, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:31:55,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1071258.0, ans=0.0 2023-06-24 09:31:55,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1071258.0, ans=0.125 2023-06-24 09:31:58,908 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.920e+02 2.591e+02 3.026e+02 3.549e+02 5.342e+02, threshold=6.052e+02, percent-clipped=0.0 2023-06-24 09:32:05,557 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-24 09:32:09,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1071318.0, ans=0.2 2023-06-24 09:32:47,960 INFO [train.py:996] (3/4) Epoch 6, batch 26100, loss[loss=0.2357, simple_loss=0.3013, pruned_loss=0.08509, over 21916.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3179, pruned_loss=0.08366, over 4269404.59 frames. ], batch size: 371, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:32:52,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1071438.0, ans=0.125 2023-06-24 09:32:59,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1071438.0, ans=0.125 2023-06-24 09:33:18,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1071498.0, ans=0.0 2023-06-24 09:34:13,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1071678.0, ans=0.04949747468305833 2023-06-24 09:34:33,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1071678.0, ans=0.2 2023-06-24 09:34:38,490 INFO [train.py:996] (3/4) Epoch 6, batch 26150, loss[loss=0.2327, simple_loss=0.3035, pruned_loss=0.08095, over 21746.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3152, pruned_loss=0.08364, over 4281965.79 frames. ], batch size: 351, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:35:11,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-24 09:35:27,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1071858.0, ans=0.05 2023-06-24 09:35:39,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.601e+02 2.864e+02 3.408e+02 4.627e+02, threshold=5.727e+02, percent-clipped=0.0 2023-06-24 09:36:15,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1071978.0, ans=0.125 2023-06-24 09:36:20,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1071978.0, ans=0.125 2023-06-24 09:36:28,867 INFO [train.py:996] (3/4) Epoch 6, batch 26200, loss[loss=0.2555, simple_loss=0.3604, pruned_loss=0.07528, over 21697.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3165, pruned_loss=0.08131, over 4282025.78 frames. ], batch size: 414, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:37:25,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1072158.0, ans=0.0 2023-06-24 09:38:22,425 INFO [train.py:996] (3/4) Epoch 6, batch 26250, loss[loss=0.2194, simple_loss=0.296, pruned_loss=0.07145, over 21831.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3197, pruned_loss=0.08096, over 4283975.63 frames. ], batch size: 282, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:39:20,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 2.533e+02 2.809e+02 3.331e+02 4.740e+02, threshold=5.619e+02, percent-clipped=0.0 2023-06-24 09:39:54,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-24 09:40:16,010 INFO [train.py:996] (3/4) Epoch 6, batch 26300, loss[loss=0.2314, simple_loss=0.2917, pruned_loss=0.08555, over 21898.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3155, pruned_loss=0.08096, over 4285106.50 frames. ], batch size: 414, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:40:18,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1072638.0, ans=0.2 2023-06-24 09:40:32,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1072698.0, ans=0.125 2023-06-24 09:41:27,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1072818.0, ans=0.0 2023-06-24 09:42:05,732 INFO [train.py:996] (3/4) Epoch 6, batch 26350, loss[loss=0.2418, simple_loss=0.3095, pruned_loss=0.08707, over 21348.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3143, pruned_loss=0.08125, over 4295131.58 frames. ], batch size: 548, lr: 4.94e-03, grad_scale: 16.0 2023-06-24 09:42:35,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-06-24 09:42:58,172 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.852e+02 3.248e+02 3.843e+02 6.054e+02, threshold=6.496e+02, percent-clipped=2.0 2023-06-24 09:43:09,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1073118.0, ans=0.025 2023-06-24 09:43:53,726 INFO [train.py:996] (3/4) Epoch 6, batch 26400, loss[loss=0.215, simple_loss=0.2716, pruned_loss=0.07921, over 21306.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3073, pruned_loss=0.08085, over 4280995.47 frames. ], batch size: 176, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:43:54,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1073238.0, ans=0.1 2023-06-24 09:44:30,714 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-24 09:45:42,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1073478.0, ans=0.125 2023-06-24 09:45:50,252 INFO [train.py:996] (3/4) Epoch 6, batch 26450, loss[loss=0.3274, simple_loss=0.4154, pruned_loss=0.1197, over 21519.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3104, pruned_loss=0.08122, over 4277021.62 frames. ], batch size: 471, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:46:12,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1073598.0, ans=0.2 2023-06-24 09:46:50,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.812e+02 3.126e+02 4.062e+02 8.206e+02, threshold=6.252e+02, percent-clipped=4.0 2023-06-24 09:47:01,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1073718.0, ans=0.125 2023-06-24 09:47:16,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1073778.0, ans=0.125 2023-06-24 09:47:32,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.06 vs. limit=15.0 2023-06-24 09:47:39,828 INFO [train.py:996] (3/4) Epoch 6, batch 26500, loss[loss=0.1902, simple_loss=0.2531, pruned_loss=0.06369, over 21260.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3103, pruned_loss=0.08013, over 4270354.60 frames. ], batch size: 159, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:48:17,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-24 09:49:03,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1074018.0, ans=0.1 2023-06-24 09:49:21,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1074078.0, ans=0.2 2023-06-24 09:49:30,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1074138.0, ans=0.0 2023-06-24 09:49:31,870 INFO [train.py:996] (3/4) Epoch 6, batch 26550, loss[loss=0.2564, simple_loss=0.3172, pruned_loss=0.09775, over 20009.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3089, pruned_loss=0.0769, over 4265660.42 frames. ], batch size: 702, lr: 4.94e-03, grad_scale: 32.0 2023-06-24 09:50:08,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1074198.0, ans=0.125 2023-06-24 09:50:42,536 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 2.610e+02 3.106e+02 3.674e+02 5.828e+02, threshold=6.212e+02, percent-clipped=0.0 2023-06-24 09:51:11,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1074378.0, ans=0.125 2023-06-24 09:51:26,528 INFO [train.py:996] (3/4) Epoch 6, batch 26600, loss[loss=0.2076, simple_loss=0.284, pruned_loss=0.06556, over 21496.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3061, pruned_loss=0.07401, over 4258176.47 frames. ], batch size: 389, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:51:55,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1074498.0, ans=0.1 2023-06-24 09:52:27,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.09 vs. limit=22.5 2023-06-24 09:52:52,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1074678.0, ans=0.0 2023-06-24 09:53:15,488 INFO [train.py:996] (3/4) Epoch 6, batch 26650, loss[loss=0.1991, simple_loss=0.2405, pruned_loss=0.07886, over 20071.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2994, pruned_loss=0.07325, over 4256583.67 frames. ], batch size: 704, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:53:38,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1074738.0, ans=0.1 2023-06-24 09:53:52,074 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:54:14,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1074858.0, ans=0.125 2023-06-24 09:54:18,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.725e+02 2.261e+02 2.468e+02 2.751e+02 5.054e+02, threshold=4.936e+02, percent-clipped=0.0 2023-06-24 09:54:25,108 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-24 09:54:27,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1074918.0, ans=0.0 2023-06-24 09:55:03,184 INFO [train.py:996] (3/4) Epoch 6, batch 26700, loss[loss=0.2372, simple_loss=0.3039, pruned_loss=0.08525, over 21790.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2911, pruned_loss=0.06956, over 4254424.94 frames. ], batch size: 441, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:55:16,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1075038.0, ans=0.125 2023-06-24 09:55:46,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1075098.0, ans=0.0 2023-06-24 09:55:46,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1075098.0, ans=0.1 2023-06-24 09:56:33,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1075278.0, ans=0.2 2023-06-24 09:56:59,419 INFO [train.py:996] (3/4) Epoch 6, batch 26750, loss[loss=0.2376, simple_loss=0.308, pruned_loss=0.08356, over 21896.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.291, pruned_loss=0.06855, over 4266300.23 frames. ], batch size: 351, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 09:57:49,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1075458.0, ans=0.125 2023-06-24 09:57:55,227 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.775e+02 2.365e+02 2.700e+02 3.222e+02 4.591e+02, threshold=5.400e+02, percent-clipped=0.0 2023-06-24 09:57:55,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1075518.0, ans=10.0 2023-06-24 09:58:08,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1075518.0, ans=0.125 2023-06-24 09:58:28,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1075518.0, ans=0.0 2023-06-24 09:58:44,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1075578.0, ans=0.125 2023-06-24 09:58:49,589 INFO [train.py:996] (3/4) Epoch 6, batch 26800, loss[loss=0.2357, simple_loss=0.309, pruned_loss=0.08118, over 21974.00 frames. ], tot_loss[loss=0.22, simple_loss=0.297, pruned_loss=0.07146, over 4270476.52 frames. ], batch size: 317, lr: 4.93e-03, grad_scale: 32.0 2023-06-24 09:59:37,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1075758.0, ans=0.2 2023-06-24 10:00:43,885 INFO [train.py:996] (3/4) Epoch 6, batch 26850, loss[loss=0.1974, simple_loss=0.2671, pruned_loss=0.06386, over 21825.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2996, pruned_loss=0.0743, over 4268015.38 frames. ], batch size: 98, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:01:33,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1076058.0, ans=0.1 2023-06-24 10:01:46,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.745e+02 3.127e+02 3.693e+02 5.292e+02, threshold=6.255e+02, percent-clipped=0.0 2023-06-24 10:02:19,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1076178.0, ans=0.125 2023-06-24 10:02:25,577 INFO [train.py:996] (3/4) Epoch 6, batch 26900, loss[loss=0.1823, simple_loss=0.2426, pruned_loss=0.06102, over 21604.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2925, pruned_loss=0.07388, over 4262226.94 frames. ], batch size: 231, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:02:53,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1076298.0, ans=0.0 2023-06-24 10:02:53,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1076298.0, ans=0.2 2023-06-24 10:03:25,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1076358.0, ans=0.125 2023-06-24 10:03:26,236 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.89 vs. limit=10.0 2023-06-24 10:04:14,907 INFO [train.py:996] (3/4) Epoch 6, batch 26950, loss[loss=0.1739, simple_loss=0.2238, pruned_loss=0.06204, over 20780.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2908, pruned_loss=0.07399, over 4267616.21 frames. ], batch size: 609, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:04:40,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1076598.0, ans=0.125 2023-06-24 10:04:40,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1076598.0, ans=0.125 2023-06-24 10:05:12,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1076658.0, ans=0.125 2023-06-24 10:05:26,437 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.499e+02 2.950e+02 4.079e+02 6.623e+02, threshold=5.900e+02, percent-clipped=3.0 2023-06-24 10:05:48,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1076778.0, ans=0.1 2023-06-24 10:06:02,712 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=12.0 2023-06-24 10:06:10,646 INFO [train.py:996] (3/4) Epoch 6, batch 27000, loss[loss=0.1931, simple_loss=0.2886, pruned_loss=0.04884, over 21750.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.291, pruned_loss=0.07218, over 4263624.47 frames. ], batch size: 316, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:06:10,647 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 10:06:28,767 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2519, simple_loss=0.3439, pruned_loss=0.0799, over 1796401.00 frames. 2023-06-24 10:06:28,768 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23366MB 2023-06-24 10:06:36,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1076838.0, ans=0.0 2023-06-24 10:06:52,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-24 10:07:57,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1077018.0, ans=0.1 2023-06-24 10:08:18,373 INFO [train.py:996] (3/4) Epoch 6, batch 27050, loss[loss=0.1877, simple_loss=0.2813, pruned_loss=0.04704, over 21685.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2943, pruned_loss=0.06909, over 4265919.32 frames. ], batch size: 247, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:08:24,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1077138.0, ans=0.125 2023-06-24 10:08:57,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1077198.0, ans=0.125 2023-06-24 10:09:12,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1077258.0, ans=0.125 2023-06-24 10:09:17,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1077258.0, ans=0.04949747468305833 2023-06-24 10:09:34,301 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.742e+02 2.401e+02 2.781e+02 3.239e+02 4.464e+02, threshold=5.563e+02, percent-clipped=0.0 2023-06-24 10:10:03,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1077378.0, ans=0.1 2023-06-24 10:10:08,113 INFO [train.py:996] (3/4) Epoch 6, batch 27100, loss[loss=0.2008, simple_loss=0.2906, pruned_loss=0.05547, over 21805.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.297, pruned_loss=0.07114, over 4280393.80 frames. ], batch size: 247, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:11:00,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1077558.0, ans=0.1 2023-06-24 10:11:37,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1077618.0, ans=0.1 2023-06-24 10:11:43,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1077678.0, ans=0.0 2023-06-24 10:11:43,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1077678.0, ans=0.0 2023-06-24 10:11:58,242 INFO [train.py:996] (3/4) Epoch 6, batch 27150, loss[loss=0.2272, simple_loss=0.3198, pruned_loss=0.06725, over 21443.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3084, pruned_loss=0.07442, over 4287080.75 frames. ], batch size: 211, lr: 4.93e-03, grad_scale: 8.0 2023-06-24 10:13:13,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.603e+02 2.899e+02 3.318e+02 5.343e+02, threshold=5.797e+02, percent-clipped=0.0 2023-06-24 10:13:52,972 INFO [train.py:996] (3/4) Epoch 6, batch 27200, loss[loss=0.3423, simple_loss=0.3891, pruned_loss=0.1477, over 21392.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3166, pruned_loss=0.07719, over 4286863.68 frames. ], batch size: 508, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:14:14,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1078098.0, ans=0.04949747468305833 2023-06-24 10:14:18,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1078098.0, ans=0.2 2023-06-24 10:14:53,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1078158.0, ans=0.2 2023-06-24 10:15:24,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1078278.0, ans=0.0 2023-06-24 10:15:48,389 INFO [train.py:996] (3/4) Epoch 6, batch 27250, loss[loss=0.251, simple_loss=0.318, pruned_loss=0.09205, over 21373.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.32, pruned_loss=0.08153, over 4287817.12 frames. ], batch size: 176, lr: 4.93e-03, grad_scale: 16.0 2023-06-24 10:15:48,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1078338.0, ans=0.0 2023-06-24 10:15:52,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1078338.0, ans=0.07 2023-06-24 10:16:25,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-24 10:16:36,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1078458.0, ans=0.125 2023-06-24 10:16:54,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1078518.0, ans=0.1 2023-06-24 10:16:56,090 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 2.982e+02 3.326e+02 3.737e+02 5.172e+02, threshold=6.652e+02, percent-clipped=0.0 2023-06-24 10:17:35,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1078578.0, ans=0.125 2023-06-24 10:17:36,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-24 10:17:45,585 INFO [train.py:996] (3/4) Epoch 6, batch 27300, loss[loss=0.2545, simple_loss=0.341, pruned_loss=0.08404, over 21913.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3217, pruned_loss=0.0824, over 4286609.69 frames. ], batch size: 372, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:17:48,284 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:19:07,474 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-24 10:19:33,549 INFO [train.py:996] (3/4) Epoch 6, batch 27350, loss[loss=0.2197, simple_loss=0.3125, pruned_loss=0.06346, over 21780.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3249, pruned_loss=0.08391, over 4285247.38 frames. ], batch size: 332, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:19:34,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1078938.0, ans=0.0 2023-06-24 10:20:37,090 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.617e+02 2.947e+02 3.408e+02 6.075e+02, threshold=5.893e+02, percent-clipped=0.0 2023-06-24 10:21:16,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1079178.0, ans=0.07 2023-06-24 10:21:19,464 INFO [train.py:996] (3/4) Epoch 6, batch 27400, loss[loss=0.204, simple_loss=0.2703, pruned_loss=0.06891, over 21659.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3185, pruned_loss=0.08219, over 4286076.34 frames. ], batch size: 247, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:22:48,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1079478.0, ans=0.1 2023-06-24 10:23:07,158 INFO [train.py:996] (3/4) Epoch 6, batch 27450, loss[loss=0.2149, simple_loss=0.3005, pruned_loss=0.06465, over 21600.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3128, pruned_loss=0.0803, over 4289797.50 frames. ], batch size: 263, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:23:18,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1079538.0, ans=0.0 2023-06-24 10:23:28,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1079598.0, ans=0.125 2023-06-24 10:24:07,320 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.466e+02 2.775e+02 3.164e+02 4.697e+02, threshold=5.550e+02, percent-clipped=0.0 2023-06-24 10:24:49,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1079838.0, ans=0.0 2023-06-24 10:24:50,395 INFO [train.py:996] (3/4) Epoch 6, batch 27500, loss[loss=0.2316, simple_loss=0.2974, pruned_loss=0.08294, over 21503.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3109, pruned_loss=0.08054, over 4294654.59 frames. ], batch size: 548, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:24:54,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1079838.0, ans=0.125 2023-06-24 10:24:56,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1079838.0, ans=0.125 2023-06-24 10:25:48,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1080018.0, ans=0.0 2023-06-24 10:25:49,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-24 10:26:26,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1080078.0, ans=0.125 2023-06-24 10:26:33,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1080138.0, ans=0.04949747468305833 2023-06-24 10:26:34,073 INFO [train.py:996] (3/4) Epoch 6, batch 27550, loss[loss=0.2158, simple_loss=0.2754, pruned_loss=0.07806, over 21502.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3062, pruned_loss=0.07732, over 4285659.88 frames. ], batch size: 441, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:27:20,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1080258.0, ans=0.125 2023-06-24 10:27:27,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1080258.0, ans=0.125 2023-06-24 10:27:29,798 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-24 10:27:39,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1080258.0, ans=0.2 2023-06-24 10:27:43,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.511e+02 2.711e+02 3.223e+02 7.892e+02, threshold=5.422e+02, percent-clipped=3.0 2023-06-24 10:28:15,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1080378.0, ans=0.125 2023-06-24 10:28:21,563 INFO [train.py:996] (3/4) Epoch 6, batch 27600, loss[loss=0.1945, simple_loss=0.2626, pruned_loss=0.06324, over 21609.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2993, pruned_loss=0.07607, over 4279771.92 frames. ], batch size: 263, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:28:51,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1080498.0, ans=0.125 2023-06-24 10:29:35,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1080618.0, ans=0.025 2023-06-24 10:29:55,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1080678.0, ans=0.125 2023-06-24 10:30:08,102 INFO [train.py:996] (3/4) Epoch 6, batch 27650, loss[loss=0.1936, simple_loss=0.2514, pruned_loss=0.06791, over 21406.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2933, pruned_loss=0.0751, over 4274676.11 frames. ], batch size: 160, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:31:12,207 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.435e+02 2.709e+02 3.081e+02 4.195e+02, threshold=5.419e+02, percent-clipped=0.0 2023-06-24 10:31:26,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1080918.0, ans=0.125 2023-06-24 10:31:55,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1081038.0, ans=0.125 2023-06-24 10:31:56,490 INFO [train.py:996] (3/4) Epoch 6, batch 27700, loss[loss=0.2487, simple_loss=0.3321, pruned_loss=0.08265, over 21713.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.293, pruned_loss=0.07406, over 4280450.24 frames. ], batch size: 298, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:32:08,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1081038.0, ans=0.125 2023-06-24 10:33:30,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1081278.0, ans=0.125 2023-06-24 10:33:45,289 INFO [train.py:996] (3/4) Epoch 6, batch 27750, loss[loss=0.212, simple_loss=0.2956, pruned_loss=0.06417, over 21431.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2957, pruned_loss=0.07317, over 4280415.43 frames. ], batch size: 211, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:34:01,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1081398.0, ans=0.125 2023-06-24 10:34:08,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-06-24 10:34:24,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1081458.0, ans=0.0 2023-06-24 10:34:31,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1081458.0, ans=0.0 2023-06-24 10:34:38,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-06-24 10:34:40,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1081458.0, ans=0.125 2023-06-24 10:34:55,017 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 2.574e+02 2.914e+02 3.859e+02 6.202e+02, threshold=5.827e+02, percent-clipped=2.0 2023-06-24 10:35:17,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1081578.0, ans=0.1 2023-06-24 10:35:31,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1081638.0, ans=0.125 2023-06-24 10:35:32,876 INFO [train.py:996] (3/4) Epoch 6, batch 27800, loss[loss=0.226, simple_loss=0.2843, pruned_loss=0.08385, over 20040.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2953, pruned_loss=0.07321, over 4283590.52 frames. ], batch size: 703, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:36:31,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1081758.0, ans=0.0 2023-06-24 10:36:45,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1081818.0, ans=0.125 2023-06-24 10:37:13,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1081878.0, ans=0.2 2023-06-24 10:37:21,726 INFO [train.py:996] (3/4) Epoch 6, batch 27850, loss[loss=0.2482, simple_loss=0.3135, pruned_loss=0.09148, over 21808.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2967, pruned_loss=0.07539, over 4291811.00 frames. ], batch size: 441, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:38:20,295 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-24 10:38:24,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1082058.0, ans=0.1 2023-06-24 10:38:39,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.601e+02 3.026e+02 3.751e+02 1.054e+03, threshold=6.053e+02, percent-clipped=6.0 2023-06-24 10:39:11,465 INFO [train.py:996] (3/4) Epoch 6, batch 27900, loss[loss=0.2427, simple_loss=0.3248, pruned_loss=0.08027, over 20042.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3068, pruned_loss=0.07743, over 4287492.03 frames. ], batch size: 703, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:40:19,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1082358.0, ans=0.04949747468305833 2023-06-24 10:40:46,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1082478.0, ans=0.1 2023-06-24 10:41:01,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-24 10:41:13,499 INFO [train.py:996] (3/4) Epoch 6, batch 27950, loss[loss=0.1773, simple_loss=0.2606, pruned_loss=0.04694, over 21411.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3045, pruned_loss=0.07408, over 4275432.63 frames. ], batch size: 211, lr: 4.92e-03, grad_scale: 16.0 2023-06-24 10:42:04,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1082658.0, ans=0.125 2023-06-24 10:42:19,428 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.526e+02 3.218e+02 4.121e+02 6.447e+02, threshold=6.437e+02, percent-clipped=1.0 2023-06-24 10:43:01,513 INFO [train.py:996] (3/4) Epoch 6, batch 28000, loss[loss=0.222, simple_loss=0.2872, pruned_loss=0.07841, over 21433.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3025, pruned_loss=0.07257, over 4275808.48 frames. ], batch size: 144, lr: 4.92e-03, grad_scale: 32.0 2023-06-24 10:43:16,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1082838.0, ans=0.125 2023-06-24 10:43:17,631 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:43:38,037 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-24 10:44:03,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1083018.0, ans=0.1 2023-06-24 10:44:57,542 INFO [train.py:996] (3/4) Epoch 6, batch 28050, loss[loss=0.1923, simple_loss=0.2568, pruned_loss=0.06392, over 21818.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2987, pruned_loss=0.07242, over 4273731.98 frames. ], batch size: 118, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:45:06,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1083138.0, ans=0.125 2023-06-24 10:45:52,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1083258.0, ans=0.1 2023-06-24 10:45:56,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.13 vs. limit=12.0 2023-06-24 10:45:57,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1083318.0, ans=0.125 2023-06-24 10:46:04,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.746e+02 3.083e+02 3.764e+02 7.718e+02, threshold=6.165e+02, percent-clipped=1.0 2023-06-24 10:46:42,331 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-24 10:46:45,931 INFO [train.py:996] (3/4) Epoch 6, batch 28100, loss[loss=0.2021, simple_loss=0.2652, pruned_loss=0.06951, over 21728.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2988, pruned_loss=0.07284, over 4275096.08 frames. ], batch size: 371, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:47:16,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1083498.0, ans=0.125 2023-06-24 10:47:39,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1083558.0, ans=0.1 2023-06-24 10:47:58,288 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.94 vs. limit=10.0 2023-06-24 10:48:34,045 INFO [train.py:996] (3/4) Epoch 6, batch 28150, loss[loss=0.2103, simple_loss=0.2723, pruned_loss=0.07415, over 21575.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2969, pruned_loss=0.07241, over 4262918.69 frames. ], batch size: 415, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:48:37,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1083738.0, ans=0.125 2023-06-24 10:48:46,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1083738.0, ans=0.125 2023-06-24 10:48:58,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1083798.0, ans=0.2 2023-06-24 10:49:25,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-24 10:49:29,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1083918.0, ans=0.1 2023-06-24 10:49:40,078 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.816e+02 3.227e+02 4.008e+02 8.112e+02, threshold=6.453e+02, percent-clipped=1.0 2023-06-24 10:50:24,156 INFO [train.py:996] (3/4) Epoch 6, batch 28200, loss[loss=0.1991, simple_loss=0.2593, pruned_loss=0.06947, over 21124.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2943, pruned_loss=0.07386, over 4271292.67 frames. ], batch size: 176, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:50:26,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1084038.0, ans=0.0 2023-06-24 10:50:30,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1084038.0, ans=0.125 2023-06-24 10:50:34,449 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.60 vs. limit=15.0 2023-06-24 10:51:51,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1084218.0, ans=0.0 2023-06-24 10:52:04,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1084278.0, ans=0.125 2023-06-24 10:52:11,996 INFO [train.py:996] (3/4) Epoch 6, batch 28250, loss[loss=0.2926, simple_loss=0.3304, pruned_loss=0.1274, over 21431.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2961, pruned_loss=0.07615, over 4264534.00 frames. ], batch size: 510, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:52:39,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.60 vs. limit=10.0 2023-06-24 10:52:42,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1084398.0, ans=0.125 2023-06-24 10:53:02,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1084458.0, ans=0.05 2023-06-24 10:53:22,108 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-24 10:53:30,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.671e+02 3.008e+02 3.478e+02 6.433e+02, threshold=6.015e+02, percent-clipped=0.0 2023-06-24 10:53:55,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1084578.0, ans=0.0 2023-06-24 10:53:57,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1084578.0, ans=0.1 2023-06-24 10:54:03,545 INFO [train.py:996] (3/4) Epoch 6, batch 28300, loss[loss=0.2065, simple_loss=0.2792, pruned_loss=0.06688, over 21382.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2941, pruned_loss=0.07351, over 4263244.44 frames. ], batch size: 160, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:54:19,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1084638.0, ans=0.125 2023-06-24 10:54:42,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=15.0 2023-06-24 10:55:28,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1084818.0, ans=0.0 2023-06-24 10:55:46,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1084878.0, ans=0.125 2023-06-24 10:55:53,157 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-24 10:55:53,161 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-24 10:55:57,132 INFO [train.py:996] (3/4) Epoch 6, batch 28350, loss[loss=0.2056, simple_loss=0.2645, pruned_loss=0.07329, over 22004.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2876, pruned_loss=0.06809, over 4266679.51 frames. ], batch size: 103, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:56:14,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1084938.0, ans=0.0 2023-06-24 10:56:25,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.26 vs. limit=22.5 2023-06-24 10:56:30,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.35 vs. limit=22.5 2023-06-24 10:56:45,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.18 vs. limit=22.5 2023-06-24 10:57:10,497 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.831e+02 2.270e+02 2.582e+02 2.935e+02 5.064e+02, threshold=5.164e+02, percent-clipped=0.0 2023-06-24 10:57:18,658 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-24 10:57:46,640 INFO [train.py:996] (3/4) Epoch 6, batch 28400, loss[loss=0.2245, simple_loss=0.2892, pruned_loss=0.07988, over 21688.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2846, pruned_loss=0.06797, over 4274672.44 frames. ], batch size: 112, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:58:19,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2023-06-24 10:58:59,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1085418.0, ans=0.125 2023-06-24 10:59:01,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1085418.0, ans=0.1 2023-06-24 10:59:06,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1085418.0, ans=0.0 2023-06-24 10:59:31,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1085478.0, ans=0.025 2023-06-24 10:59:36,143 INFO [train.py:996] (3/4) Epoch 6, batch 28450, loss[loss=0.1689, simple_loss=0.2249, pruned_loss=0.05646, over 20733.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2894, pruned_loss=0.0718, over 4280718.49 frames. ], batch size: 607, lr: 4.91e-03, grad_scale: 32.0 2023-06-24 10:59:47,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-24 10:59:54,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1085538.0, ans=0.0 2023-06-24 11:00:30,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1085658.0, ans=0.125 2023-06-24 11:00:42,790 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.782e+02 3.154e+02 3.614e+02 5.624e+02, threshold=6.308e+02, percent-clipped=2.0 2023-06-24 11:01:25,032 INFO [train.py:996] (3/4) Epoch 6, batch 28500, loss[loss=0.2254, simple_loss=0.2946, pruned_loss=0.07812, over 21879.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2928, pruned_loss=0.07452, over 4286802.23 frames. ], batch size: 124, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:01:27,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1085838.0, ans=0.125 2023-06-24 11:01:29,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1085838.0, ans=0.125 2023-06-24 11:01:50,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=15.0 2023-06-24 11:02:02,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1085898.0, ans=0.125 2023-06-24 11:02:02,787 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-06-24 11:02:36,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1086018.0, ans=0.125 2023-06-24 11:03:03,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1086078.0, ans=0.2 2023-06-24 11:03:17,202 INFO [train.py:996] (3/4) Epoch 6, batch 28550, loss[loss=0.2323, simple_loss=0.3188, pruned_loss=0.07291, over 21282.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3009, pruned_loss=0.077, over 4286981.65 frames. ], batch size: 143, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:04:37,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.822e+02 3.220e+02 3.775e+02 6.822e+02, threshold=6.440e+02, percent-clipped=1.0 2023-06-24 11:04:39,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1086318.0, ans=0.125 2023-06-24 11:05:12,854 INFO [train.py:996] (3/4) Epoch 6, batch 28600, loss[loss=0.2243, simple_loss=0.2997, pruned_loss=0.07442, over 21734.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3079, pruned_loss=0.07937, over 4282975.76 frames. ], batch size: 247, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:05:28,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-06-24 11:05:32,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1086438.0, ans=0.125 2023-06-24 11:06:05,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1086558.0, ans=0.0 2023-06-24 11:06:07,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1086558.0, ans=0.125 2023-06-24 11:06:28,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-24 11:06:52,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1086678.0, ans=0.2 2023-06-24 11:07:06,284 INFO [train.py:996] (3/4) Epoch 6, batch 28650, loss[loss=0.1919, simple_loss=0.2573, pruned_loss=0.06328, over 21538.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3023, pruned_loss=0.07862, over 4277399.36 frames. ], batch size: 196, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:07:07,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-24 11:07:17,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1086738.0, ans=0.0 2023-06-24 11:07:30,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1086798.0, ans=0.125 2023-06-24 11:07:31,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1086798.0, ans=0.125 2023-06-24 11:07:45,734 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:08:14,717 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.693e+02 2.997e+02 3.394e+02 5.567e+02, threshold=5.993e+02, percent-clipped=0.0 2023-06-24 11:08:34,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1086978.0, ans=0.125 2023-06-24 11:08:54,867 INFO [train.py:996] (3/4) Epoch 6, batch 28700, loss[loss=0.2434, simple_loss=0.3144, pruned_loss=0.08617, over 21650.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3009, pruned_loss=0.07951, over 4280488.51 frames. ], batch size: 389, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:09:27,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1087098.0, ans=0.1 2023-06-24 11:09:55,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1087158.0, ans=0.07 2023-06-24 11:10:04,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1087218.0, ans=0.125 2023-06-24 11:10:09,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1087218.0, ans=0.0 2023-06-24 11:10:17,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-24 11:10:44,957 INFO [train.py:996] (3/4) Epoch 6, batch 28750, loss[loss=0.2178, simple_loss=0.2859, pruned_loss=0.07486, over 21447.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3022, pruned_loss=0.08078, over 4284143.09 frames. ], batch size: 144, lr: 4.91e-03, grad_scale: 16.0 2023-06-24 11:11:06,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1087398.0, ans=0.0 2023-06-24 11:11:48,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1087518.0, ans=0.0 2023-06-24 11:11:52,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1087518.0, ans=0.125 2023-06-24 11:11:53,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.664e+02 3.026e+02 3.382e+02 4.910e+02, threshold=6.051e+02, percent-clipped=0.0 2023-06-24 11:12:33,143 INFO [train.py:996] (3/4) Epoch 6, batch 28800, loss[loss=0.244, simple_loss=0.3176, pruned_loss=0.08526, over 21840.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3035, pruned_loss=0.07992, over 4275594.28 frames. ], batch size: 282, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:13:17,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1087758.0, ans=0.1 2023-06-24 11:13:55,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-24 11:14:02,096 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-24 11:14:11,744 INFO [train.py:996] (3/4) Epoch 6, batch 28850, loss[loss=0.2298, simple_loss=0.3004, pruned_loss=0.07958, over 21688.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3051, pruned_loss=0.08095, over 4280650.07 frames. ], batch size: 263, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:14:42,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1087998.0, ans=0.1 2023-06-24 11:15:12,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1088058.0, ans=0.125 2023-06-24 11:15:23,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1088118.0, ans=0.125 2023-06-24 11:15:25,930 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.795e+02 3.097e+02 3.558e+02 6.026e+02, threshold=6.195e+02, percent-clipped=0.0 2023-06-24 11:16:01,655 INFO [train.py:996] (3/4) Epoch 6, batch 28900, loss[loss=0.2294, simple_loss=0.3005, pruned_loss=0.07913, over 21893.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3075, pruned_loss=0.082, over 4277787.74 frames. ], batch size: 316, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:16:34,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1088298.0, ans=0.04949747468305833 2023-06-24 11:16:38,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-24 11:16:38,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-24 11:17:36,037 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-24 11:17:39,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1088478.0, ans=0.1 2023-06-24 11:18:05,918 INFO [train.py:996] (3/4) Epoch 6, batch 28950, loss[loss=0.2205, simple_loss=0.317, pruned_loss=0.06198, over 21820.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3098, pruned_loss=0.08168, over 4274182.95 frames. ], batch size: 316, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:18:57,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1088658.0, ans=0.1 2023-06-24 11:19:17,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1088718.0, ans=0.125 2023-06-24 11:19:17,995 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.066e+02 3.487e+02 4.356e+02 7.485e+02, threshold=6.974e+02, percent-clipped=4.0 2023-06-24 11:19:46,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1088778.0, ans=0.125 2023-06-24 11:19:46,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1088778.0, ans=0.125 2023-06-24 11:19:53,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1088778.0, ans=0.1 2023-06-24 11:19:57,025 INFO [train.py:996] (3/4) Epoch 6, batch 29000, loss[loss=0.2581, simple_loss=0.3331, pruned_loss=0.09155, over 21580.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3124, pruned_loss=0.08174, over 4272055.68 frames. ], batch size: 414, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:20:37,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1088898.0, ans=0.125 2023-06-24 11:20:59,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1089018.0, ans=0.015 2023-06-24 11:21:21,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=12.0 2023-06-24 11:21:39,609 INFO [train.py:996] (3/4) Epoch 6, batch 29050, loss[loss=0.2283, simple_loss=0.292, pruned_loss=0.08229, over 21528.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3117, pruned_loss=0.08211, over 4281193.83 frames. ], batch size: 194, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:22:06,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1089198.0, ans=0.125 2023-06-24 11:22:59,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.602e+02 2.960e+02 3.468e+02 4.732e+02, threshold=5.920e+02, percent-clipped=0.0 2023-06-24 11:23:27,544 INFO [train.py:996] (3/4) Epoch 6, batch 29100, loss[loss=0.1815, simple_loss=0.249, pruned_loss=0.05699, over 21606.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3032, pruned_loss=0.07939, over 4283005.15 frames. ], batch size: 231, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:23:53,334 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-24 11:24:14,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1089558.0, ans=0.2 2023-06-24 11:24:32,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-24 11:25:10,506 INFO [train.py:996] (3/4) Epoch 6, batch 29150, loss[loss=0.2296, simple_loss=0.3004, pruned_loss=0.07944, over 21772.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3023, pruned_loss=0.07795, over 4285413.09 frames. ], batch size: 371, lr: 4.90e-03, grad_scale: 16.0 2023-06-24 11:26:10,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1089858.0, ans=0.0 2023-06-24 11:26:21,424 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:26:27,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1089918.0, ans=0.125 2023-06-24 11:26:30,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.501e+02 2.832e+02 3.252e+02 5.475e+02, threshold=5.663e+02, percent-clipped=0.0 2023-06-24 11:26:58,347 INFO [train.py:996] (3/4) Epoch 6, batch 29200, loss[loss=0.2657, simple_loss=0.3226, pruned_loss=0.1044, over 21438.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2987, pruned_loss=0.0774, over 4276511.58 frames. ], batch size: 509, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:27:39,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1090098.0, ans=0.0 2023-06-24 11:28:18,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-06-24 11:28:20,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1090218.0, ans=0.125 2023-06-24 11:28:26,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1090218.0, ans=0.0 2023-06-24 11:28:33,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1090278.0, ans=0.125 2023-06-24 11:28:44,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1090278.0, ans=0.1 2023-06-24 11:28:47,227 INFO [train.py:996] (3/4) Epoch 6, batch 29250, loss[loss=0.2182, simple_loss=0.3085, pruned_loss=0.0639, over 21763.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2973, pruned_loss=0.07563, over 4275250.18 frames. ], batch size: 282, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:29:20,439 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-24 11:30:08,090 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.479e+02 2.949e+02 4.059e+02 6.998e+02, threshold=5.898e+02, percent-clipped=9.0 2023-06-24 11:30:08,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1090518.0, ans=0.025 2023-06-24 11:30:40,979 INFO [train.py:996] (3/4) Epoch 6, batch 29300, loss[loss=0.1958, simple_loss=0.2669, pruned_loss=0.06235, over 21364.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2984, pruned_loss=0.07516, over 4263944.36 frames. ], batch size: 194, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:32:31,104 INFO [train.py:996] (3/4) Epoch 6, batch 29350, loss[loss=0.2088, simple_loss=0.2937, pruned_loss=0.06199, over 21248.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2933, pruned_loss=0.0737, over 4257897.93 frames. ], batch size: 176, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:33:06,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1090998.0, ans=0.125 2023-06-24 11:33:19,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1091058.0, ans=0.2 2023-06-24 11:33:24,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1091058.0, ans=0.125 2023-06-24 11:33:43,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 2.584e+02 3.038e+02 3.610e+02 5.891e+02, threshold=6.076e+02, percent-clipped=0.0 2023-06-24 11:34:02,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.02 vs. limit=15.0 2023-06-24 11:34:22,816 INFO [train.py:996] (3/4) Epoch 6, batch 29400, loss[loss=0.2147, simple_loss=0.3131, pruned_loss=0.05817, over 21719.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2934, pruned_loss=0.072, over 4251227.40 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:35:08,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1091358.0, ans=0.1 2023-06-24 11:35:39,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1091418.0, ans=0.2 2023-06-24 11:35:42,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1091418.0, ans=0.125 2023-06-24 11:36:00,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1091478.0, ans=0.1 2023-06-24 11:36:11,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1091538.0, ans=0.0 2023-06-24 11:36:12,514 INFO [train.py:996] (3/4) Epoch 6, batch 29450, loss[loss=0.2474, simple_loss=0.3212, pruned_loss=0.0868, over 21722.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2932, pruned_loss=0.07125, over 4259180.34 frames. ], batch size: 332, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:37:10,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1091658.0, ans=0.025 2023-06-24 11:37:26,462 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.589e+02 2.908e+02 3.358e+02 5.330e+02, threshold=5.817e+02, percent-clipped=0.0 2023-06-24 11:37:32,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1091718.0, ans=0.125 2023-06-24 11:37:36,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1091778.0, ans=0.125 2023-06-24 11:37:57,894 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-24 11:38:00,157 INFO [train.py:996] (3/4) Epoch 6, batch 29500, loss[loss=0.296, simple_loss=0.3351, pruned_loss=0.1285, over 21743.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2964, pruned_loss=0.07398, over 4262436.01 frames. ], batch size: 508, lr: 4.90e-03, grad_scale: 32.0 2023-06-24 11:39:15,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1092018.0, ans=0.0 2023-06-24 11:39:18,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.33 vs. limit=22.5 2023-06-24 11:39:43,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1092078.0, ans=0.125 2023-06-24 11:39:45,519 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-24 11:39:46,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1092078.0, ans=0.125 2023-06-24 11:39:49,832 INFO [train.py:996] (3/4) Epoch 6, batch 29550, loss[loss=0.2232, simple_loss=0.292, pruned_loss=0.07717, over 21839.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2963, pruned_loss=0.07603, over 4274664.82 frames. ], batch size: 332, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:39:50,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1092138.0, ans=0.125 2023-06-24 11:40:11,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1092198.0, ans=0.125 2023-06-24 11:40:13,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-06-24 11:41:00,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1092318.0, ans=0.0 2023-06-24 11:41:05,559 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.863e+02 3.307e+02 3.931e+02 5.796e+02, threshold=6.614e+02, percent-clipped=0.0 2023-06-24 11:41:18,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1092318.0, ans=0.1 2023-06-24 11:41:35,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1092378.0, ans=0.0 2023-06-24 11:41:39,640 INFO [train.py:996] (3/4) Epoch 6, batch 29600, loss[loss=0.2428, simple_loss=0.3162, pruned_loss=0.08468, over 21210.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3015, pruned_loss=0.07785, over 4278162.50 frames. ], batch size: 143, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:42:07,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1092498.0, ans=0.125 2023-06-24 11:42:14,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1092498.0, ans=0.1 2023-06-24 11:42:21,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1092498.0, ans=0.125 2023-06-24 11:43:19,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-24 11:43:19,302 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-24 11:43:26,413 INFO [train.py:996] (3/4) Epoch 6, batch 29650, loss[loss=0.294, simple_loss=0.4091, pruned_loss=0.08947, over 19829.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2988, pruned_loss=0.07459, over 4274689.70 frames. ], batch size: 702, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:43:39,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1092738.0, ans=0.0 2023-06-24 11:43:53,762 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-24 11:44:09,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-24 11:44:30,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1092858.0, ans=0.0 2023-06-24 11:44:47,010 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.627e+02 2.546e+02 3.028e+02 3.755e+02 5.764e+02, threshold=6.055e+02, percent-clipped=0.0 2023-06-24 11:45:07,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1092978.0, ans=0.5 2023-06-24 11:45:14,616 INFO [train.py:996] (3/4) Epoch 6, batch 29700, loss[loss=0.3273, simple_loss=0.4241, pruned_loss=0.1152, over 21528.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3033, pruned_loss=0.07535, over 4272124.34 frames. ], batch size: 471, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:45:37,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1093038.0, ans=0.0 2023-06-24 11:45:44,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1093098.0, ans=0.125 2023-06-24 11:46:07,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1093158.0, ans=0.1 2023-06-24 11:47:02,446 INFO [train.py:996] (3/4) Epoch 6, batch 29750, loss[loss=0.2725, simple_loss=0.363, pruned_loss=0.09101, over 21681.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3074, pruned_loss=0.07508, over 4269951.70 frames. ], batch size: 441, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:47:12,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.08 vs. limit=22.5 2023-06-24 11:47:23,856 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-24 11:47:28,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1093398.0, ans=0.0 2023-06-24 11:47:36,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1093398.0, ans=0.1 2023-06-24 11:47:36,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1093398.0, ans=0.125 2023-06-24 11:47:41,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1093398.0, ans=0.125 2023-06-24 11:48:23,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.842e+02 2.425e+02 2.693e+02 3.074e+02 5.352e+02, threshold=5.385e+02, percent-clipped=0.0 2023-06-24 11:48:23,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1093518.0, ans=0.125 2023-06-24 11:48:54,211 INFO [train.py:996] (3/4) Epoch 6, batch 29800, loss[loss=0.2317, simple_loss=0.3045, pruned_loss=0.0795, over 21194.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3086, pruned_loss=0.07611, over 4275308.94 frames. ], batch size: 143, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:49:13,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1093638.0, ans=0.0 2023-06-24 11:49:20,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1093698.0, ans=0.125 2023-06-24 11:49:27,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1093698.0, ans=0.0 2023-06-24 11:50:34,852 INFO [train.py:996] (3/4) Epoch 6, batch 29850, loss[loss=0.2203, simple_loss=0.2921, pruned_loss=0.07424, over 21764.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3039, pruned_loss=0.07402, over 4275551.33 frames. ], batch size: 112, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:51:05,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1093998.0, ans=0.0 2023-06-24 11:51:07,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1093998.0, ans=0.1 2023-06-24 11:51:33,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1094058.0, ans=0.125 2023-06-24 11:51:55,460 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.491e+02 2.734e+02 3.399e+02 8.130e+02, threshold=5.469e+02, percent-clipped=4.0 2023-06-24 11:52:03,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1094178.0, ans=0.125 2023-06-24 11:52:05,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-24 11:52:26,402 INFO [train.py:996] (3/4) Epoch 6, batch 29900, loss[loss=0.2958, simple_loss=0.3461, pruned_loss=0.1228, over 21498.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3027, pruned_loss=0.07539, over 4282914.74 frames. ], batch size: 471, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:52:36,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1094238.0, ans=0.1 2023-06-24 11:52:41,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1094238.0, ans=0.125 2023-06-24 11:52:57,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1094298.0, ans=0.125 2023-06-24 11:53:04,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1094298.0, ans=0.1 2023-06-24 11:53:05,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1094298.0, ans=0.5 2023-06-24 11:53:43,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.10 vs. limit=22.5 2023-06-24 11:53:50,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1094418.0, ans=0.125 2023-06-24 11:54:16,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1094478.0, ans=0.0 2023-06-24 11:54:23,079 INFO [train.py:996] (3/4) Epoch 6, batch 29950, loss[loss=0.2732, simple_loss=0.341, pruned_loss=0.1027, over 21795.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3077, pruned_loss=0.07927, over 4288576.28 frames. ], batch size: 441, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 11:54:37,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1094538.0, ans=0.125 2023-06-24 11:55:04,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1094598.0, ans=0.125 2023-06-24 11:55:21,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1094658.0, ans=0.0 2023-06-24 11:55:41,530 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.234e+02 2.840e+02 3.123e+02 3.616e+02 5.024e+02, threshold=6.246e+02, percent-clipped=0.0 2023-06-24 11:55:45,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1094718.0, ans=0.125 2023-06-24 11:56:09,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1094778.0, ans=0.04949747468305833 2023-06-24 11:56:13,769 INFO [train.py:996] (3/4) Epoch 6, batch 30000, loss[loss=0.2147, simple_loss=0.3049, pruned_loss=0.06228, over 21647.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3088, pruned_loss=0.07932, over 4288144.86 frames. ], batch size: 230, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:56:13,770 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 11:56:34,161 INFO [train.py:1028] (3/4) Epoch 6, validation: loss=0.2459, simple_loss=0.3437, pruned_loss=0.07409, over 1796401.00 frames. 2023-06-24 11:56:34,162 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23366MB 2023-06-24 11:57:05,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1094898.0, ans=0.125 2023-06-24 11:57:44,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1094958.0, ans=0.0 2023-06-24 11:57:50,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=22.5 2023-06-24 11:57:55,649 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:58:36,145 INFO [train.py:996] (3/4) Epoch 6, batch 30050, loss[loss=0.3104, simple_loss=0.4084, pruned_loss=0.1062, over 21525.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3116, pruned_loss=0.07589, over 4276344.07 frames. ], batch size: 471, lr: 4.89e-03, grad_scale: 32.0 2023-06-24 11:58:51,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1095138.0, ans=0.0 2023-06-24 11:59:15,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-24 11:59:55,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.808e+02 2.460e+02 2.888e+02 3.811e+02 6.345e+02, threshold=5.776e+02, percent-clipped=1.0 2023-06-24 12:00:24,952 INFO [train.py:996] (3/4) Epoch 6, batch 30100, loss[loss=0.2076, simple_loss=0.2705, pruned_loss=0.07239, over 21631.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.312, pruned_loss=0.07563, over 4266999.72 frames. ], batch size: 333, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:00:51,895 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:01:06,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1095498.0, ans=0.04949747468305833 2023-06-24 12:01:12,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1095558.0, ans=0.1 2023-06-24 12:02:02,197 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-24 12:02:15,859 INFO [train.py:996] (3/4) Epoch 6, batch 30150, loss[loss=0.3255, simple_loss=0.3658, pruned_loss=0.1426, over 21319.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3094, pruned_loss=0.07771, over 4274377.46 frames. ], batch size: 507, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:02:25,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1095738.0, ans=0.125 2023-06-24 12:02:27,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1095738.0, ans=0.125 2023-06-24 12:02:42,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1095738.0, ans=0.0 2023-06-24 12:02:48,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=15.0 2023-06-24 12:03:13,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1095858.0, ans=0.125 2023-06-24 12:03:44,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 2.662e+02 2.970e+02 3.572e+02 6.402e+02, threshold=5.941e+02, percent-clipped=1.0 2023-06-24 12:03:55,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1095978.0, ans=0.07 2023-06-24 12:04:19,437 INFO [train.py:996] (3/4) Epoch 6, batch 30200, loss[loss=0.2663, simple_loss=0.3458, pruned_loss=0.09344, over 21748.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.311, pruned_loss=0.07657, over 4272836.01 frames. ], batch size: 441, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:04:29,422 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=12.0 2023-06-24 12:05:43,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=12.0 2023-06-24 12:06:04,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-24 12:06:10,624 INFO [train.py:996] (3/4) Epoch 6, batch 30250, loss[loss=0.2769, simple_loss=0.3711, pruned_loss=0.09131, over 21313.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3177, pruned_loss=0.0782, over 4273666.98 frames. ], batch size: 549, lr: 4.89e-03, grad_scale: 16.0 2023-06-24 12:06:23,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1096338.0, ans=0.2 2023-06-24 12:06:25,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1096338.0, ans=0.0 2023-06-24 12:06:30,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1096338.0, ans=0.1 2023-06-24 12:06:41,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-24 12:07:27,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 2.716e+02 3.093e+02 3.619e+02 5.439e+02, threshold=6.186e+02, percent-clipped=0.0 2023-06-24 12:07:31,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.13 vs. limit=15.0 2023-06-24 12:07:57,914 INFO [train.py:996] (3/4) Epoch 6, batch 30300, loss[loss=0.2046, simple_loss=0.2713, pruned_loss=0.06892, over 21924.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3135, pruned_loss=0.07795, over 4279761.87 frames. ], batch size: 119, lr: 4.88e-03, grad_scale: 16.0 2023-06-24 12:07:58,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1096638.0, ans=0.1 2023-06-24 12:08:00,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1096638.0, ans=0.125 2023-06-24 12:08:19,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1096698.0, ans=0.2 2023-06-24 12:08:54,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1096758.0, ans=0.025 2023-06-24 12:09:44,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2023-06-24 12:09:53,994 INFO [train.py:996] (3/4) Epoch 6, batch 30350, loss[loss=0.2896, simple_loss=0.3662, pruned_loss=0.1065, over 21547.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3146, pruned_loss=0.07956, over 4282389.49 frames. ], batch size: 441, lr: 4.88e-03, grad_scale: 16.0 2023-06-24 12:10:22,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1096998.0, ans=22.5 2023-06-24 12:10:31,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1097058.0, ans=0.1 2023-06-24 12:10:56,358 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.694e+02 3.043e+02 3.524e+02 5.331e+02, threshold=6.085e+02, percent-clipped=0.0 2023-06-24 12:11:12,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1097178.0, ans=0.125 2023-06-24 12:11:13,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1097178.0, ans=0.1 2023-06-24 12:11:27,900 INFO [train.py:996] (3/4) Epoch 6, batch 30400, loss[loss=0.2227, simple_loss=0.272, pruned_loss=0.08669, over 20118.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3087, pruned_loss=0.07821, over 4271425.90 frames. ], batch size: 702, lr: 4.88e-03, grad_scale: 32.0 2023-06-24 12:11:30,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1097238.0, ans=0.1 2023-06-24 12:11:32,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1097238.0, ans=0.2 2023-06-24 12:11:36,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1097238.0, ans=0.125 2023-06-24 12:11:37,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1097238.0, ans=0.1 2023-06-24 12:12:03,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1097358.0, ans=0.1 2023-06-24 12:12:57,209 INFO [train.py:996] (3/4) Epoch 6, batch 30450, loss[loss=0.3034, simple_loss=0.4147, pruned_loss=0.09602, over 19755.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3102, pruned_loss=0.07871, over 4209705.33 frames. ], batch size: 702, lr: 4.88e-03, grad_scale: 32.0 2023-06-24 12:13:16,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1097598.0, ans=0.0 2023-06-24 12:13:30,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1097658.0, ans=0.05 2023-06-24 12:13:56,608 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 4.419e+02 5.663e+02 8.899e+02 2.204e+03, threshold=1.133e+03, percent-clipped=46.0 2023-06-24 12:16:21,097 INFO [train.py:996] (3/4) Epoch 7, batch 0, loss[loss=0.1929, simple_loss=0.2609, pruned_loss=0.06249, over 21277.00 frames. ], tot_loss[loss=0.1929, simple_loss=0.2609, pruned_loss=0.06249, over 21277.00 frames. ], batch size: 551, lr: 4.48e-03, grad_scale: 32.0 2023-06-24 12:16:21,098 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 12:16:38,597 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2421, simple_loss=0.346, pruned_loss=0.0691, over 1796401.00 frames. 2023-06-24 12:16:38,598 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23366MB 2023-06-24 12:17:42,267 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-24 12:18:25,401 INFO [train.py:996] (3/4) Epoch 7, batch 50, loss[loss=0.2644, simple_loss=0.3351, pruned_loss=0.09688, over 21388.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3067, pruned_loss=0.07539, over 965558.97 frames. ], batch size: 471, lr: 4.48e-03, grad_scale: 32.0 2023-06-24 12:18:58,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-24 12:19:20,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1098222.0, ans=0.2 2023-06-24 12:20:01,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.689e+02 3.085e+02 3.734e+02 9.044e+02, threshold=6.169e+02, percent-clipped=0.0 2023-06-24 12:20:06,065 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.96 vs. limit=22.5 2023-06-24 12:20:13,726 INFO [train.py:996] (3/4) Epoch 7, batch 100, loss[loss=0.2539, simple_loss=0.3372, pruned_loss=0.08532, over 19926.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3268, pruned_loss=0.08016, over 1699499.32 frames. ], batch size: 702, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:20:15,810 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:20:39,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1098462.0, ans=0.0 2023-06-24 12:20:55,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1098462.0, ans=0.1 2023-06-24 12:22:00,444 INFO [train.py:996] (3/4) Epoch 7, batch 150, loss[loss=0.2422, simple_loss=0.3334, pruned_loss=0.07544, over 21744.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3301, pruned_loss=0.08095, over 2263912.74 frames. ], batch size: 298, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:22:59,375 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=15.0 2023-06-24 12:23:36,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.604e+02 2.896e+02 3.363e+02 6.379e+02, threshold=5.792e+02, percent-clipped=1.0 2023-06-24 12:23:47,923 INFO [train.py:996] (3/4) Epoch 7, batch 200, loss[loss=0.2519, simple_loss=0.3452, pruned_loss=0.07936, over 21674.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3263, pruned_loss=0.08018, over 2700777.93 frames. ], batch size: 414, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:24:54,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1099122.0, ans=0.0 2023-06-24 12:25:30,322 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:25:36,648 INFO [train.py:996] (3/4) Epoch 7, batch 250, loss[loss=0.2682, simple_loss=0.3163, pruned_loss=0.1101, over 21683.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3221, pruned_loss=0.08043, over 3045991.44 frames. ], batch size: 507, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:26:16,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1099362.0, ans=0.0 2023-06-24 12:26:36,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1099422.0, ans=0.025 2023-06-24 12:26:44,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1099482.0, ans=0.125 2023-06-24 12:27:08,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1099542.0, ans=0.0 2023-06-24 12:27:14,907 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.504e+02 2.848e+02 3.185e+02 4.478e+02, threshold=5.696e+02, percent-clipped=0.0 2023-06-24 12:27:27,341 INFO [train.py:996] (3/4) Epoch 7, batch 300, loss[loss=0.1958, simple_loss=0.2614, pruned_loss=0.06515, over 21618.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.315, pruned_loss=0.07797, over 3315920.38 frames. ], batch size: 247, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:28:15,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1099662.0, ans=0.1 2023-06-24 12:28:34,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1099722.0, ans=0.1 2023-06-24 12:29:18,767 INFO [train.py:996] (3/4) Epoch 7, batch 350, loss[loss=0.2595, simple_loss=0.3396, pruned_loss=0.08969, over 21724.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3086, pruned_loss=0.07688, over 3528978.94 frames. ], batch size: 441, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:29:23,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1099902.0, ans=0.125 2023-06-24 12:30:34,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1100082.0, ans=0.125 2023-06-24 12:30:58,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.904e+02 2.718e+02 3.112e+02 3.692e+02 6.265e+02, threshold=6.224e+02, percent-clipped=2.0 2023-06-24 12:30:59,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1100142.0, ans=0.125 2023-06-24 12:31:10,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1100202.0, ans=0.1 2023-06-24 12:31:11,303 INFO [train.py:996] (3/4) Epoch 7, batch 400, loss[loss=0.2257, simple_loss=0.3307, pruned_loss=0.06035, over 19848.00 frames. ], tot_loss[loss=0.227, simple_loss=0.304, pruned_loss=0.07499, over 3694697.33 frames. ], batch size: 703, lr: 4.48e-03, grad_scale: 32.0 2023-06-24 12:31:29,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1100202.0, ans=0.1 2023-06-24 12:32:00,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-24 12:32:03,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1100322.0, ans=0.125 2023-06-24 12:32:56,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.79 vs. limit=10.0 2023-06-24 12:33:02,100 INFO [train.py:996] (3/4) Epoch 7, batch 450, loss[loss=0.2491, simple_loss=0.3621, pruned_loss=0.068, over 21738.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3027, pruned_loss=0.07426, over 3827241.21 frames. ], batch size: 414, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:33:43,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1100562.0, ans=0.07 2023-06-24 12:33:51,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1100622.0, ans=0.2 2023-06-24 12:34:34,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-24 12:34:40,565 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.615e+02 3.361e+02 4.061e+02 5.988e+02, threshold=6.722e+02, percent-clipped=0.0 2023-06-24 12:34:41,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-24 12:34:57,191 INFO [train.py:996] (3/4) Epoch 7, batch 500, loss[loss=0.2068, simple_loss=0.2729, pruned_loss=0.07034, over 21730.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3005, pruned_loss=0.07303, over 3933220.92 frames. ], batch size: 112, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:35:41,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=8.0 2023-06-24 12:36:30,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1101042.0, ans=0.2 2023-06-24 12:36:46,106 INFO [train.py:996] (3/4) Epoch 7, batch 550, loss[loss=0.2289, simple_loss=0.2979, pruned_loss=0.07997, over 21905.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2997, pruned_loss=0.07158, over 4007513.59 frames. ], batch size: 107, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:36:51,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1101102.0, ans=0.125 2023-06-24 12:37:14,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1101162.0, ans=0.09899494936611666 2023-06-24 12:37:19,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1101162.0, ans=0.1 2023-06-24 12:37:30,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1101222.0, ans=0.125 2023-06-24 12:37:38,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1101222.0, ans=0.025 2023-06-24 12:37:59,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1101282.0, ans=0.025 2023-06-24 12:38:01,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1101282.0, ans=0.0 2023-06-24 12:38:13,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1101342.0, ans=0.0 2023-06-24 12:38:14,376 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 2.641e+02 3.136e+02 3.627e+02 5.437e+02, threshold=6.272e+02, percent-clipped=0.0 2023-06-24 12:38:16,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1101342.0, ans=0.0 2023-06-24 12:38:28,514 INFO [train.py:996] (3/4) Epoch 7, batch 600, loss[loss=0.217, simple_loss=0.2884, pruned_loss=0.07282, over 21511.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3046, pruned_loss=0.07245, over 4071251.19 frames. ], batch size: 194, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:38:50,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1101462.0, ans=0.2 2023-06-24 12:39:24,978 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:39:27,038 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:39:38,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1101582.0, ans=0.2 2023-06-24 12:39:44,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1101582.0, ans=0.125 2023-06-24 12:39:59,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1101642.0, ans=0.0 2023-06-24 12:40:16,846 INFO [train.py:996] (3/4) Epoch 7, batch 650, loss[loss=0.2267, simple_loss=0.3144, pruned_loss=0.06949, over 21356.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3084, pruned_loss=0.07297, over 4124257.53 frames. ], batch size: 211, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:40:48,184 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-24 12:41:10,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1101822.0, ans=0.125 2023-06-24 12:41:27,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1101882.0, ans=0.125 2023-06-24 12:41:29,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1101882.0, ans=0.125 2023-06-24 12:41:51,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.716e+02 3.087e+02 3.645e+02 5.920e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-24 12:41:54,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1101942.0, ans=0.0 2023-06-24 12:42:01,012 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:42:05,932 INFO [train.py:996] (3/4) Epoch 7, batch 700, loss[loss=0.2543, simple_loss=0.3302, pruned_loss=0.08919, over 21806.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3092, pruned_loss=0.07383, over 4151245.32 frames. ], batch size: 112, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:42:06,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1102002.0, ans=0.0 2023-06-24 12:43:16,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=12.0 2023-06-24 12:43:19,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1102182.0, ans=0.125 2023-06-24 12:43:19,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1102182.0, ans=0.0 2023-06-24 12:43:30,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1102242.0, ans=0.5 2023-06-24 12:43:59,372 INFO [train.py:996] (3/4) Epoch 7, batch 750, loss[loss=0.2398, simple_loss=0.3619, pruned_loss=0.05888, over 19782.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3093, pruned_loss=0.07477, over 4180787.20 frames. ], batch size: 702, lr: 4.48e-03, grad_scale: 8.0 2023-06-24 12:44:37,422 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=15.0 2023-06-24 12:44:37,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-24 12:44:57,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1102422.0, ans=0.125 2023-06-24 12:45:28,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.942e+02 3.385e+02 4.235e+02 7.679e+02, threshold=6.771e+02, percent-clipped=3.0 2023-06-24 12:45:42,986 INFO [train.py:996] (3/4) Epoch 7, batch 800, loss[loss=0.2139, simple_loss=0.2906, pruned_loss=0.0686, over 21314.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3072, pruned_loss=0.0755, over 4209308.79 frames. ], batch size: 131, lr: 4.48e-03, grad_scale: 16.0 2023-06-24 12:45:57,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1102602.0, ans=0.1 2023-06-24 12:46:03,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2023-06-24 12:46:19,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1102662.0, ans=0.125 2023-06-24 12:46:33,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1102722.0, ans=0.0 2023-06-24 12:46:36,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1102722.0, ans=0.125 2023-06-24 12:46:51,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.02 vs. limit=12.0 2023-06-24 12:47:03,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1102782.0, ans=0.125 2023-06-24 12:47:38,980 INFO [train.py:996] (3/4) Epoch 7, batch 850, loss[loss=0.1972, simple_loss=0.2562, pruned_loss=0.06906, over 21533.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.305, pruned_loss=0.07545, over 4229851.23 frames. ], batch size: 263, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:48:04,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1102962.0, ans=0.125 2023-06-24 12:48:06,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1102962.0, ans=0.2 2023-06-24 12:49:04,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1103142.0, ans=0.2 2023-06-24 12:49:07,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.715e+02 3.192e+02 3.563e+02 7.547e+02, threshold=6.383e+02, percent-clipped=1.0 2023-06-24 12:49:17,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1103142.0, ans=0.2 2023-06-24 12:49:27,447 INFO [train.py:996] (3/4) Epoch 7, batch 900, loss[loss=0.1973, simple_loss=0.2779, pruned_loss=0.05832, over 21164.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2986, pruned_loss=0.07364, over 4234873.52 frames. ], batch size: 548, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:51:11,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1103442.0, ans=0.0 2023-06-24 12:51:17,569 INFO [train.py:996] (3/4) Epoch 7, batch 950, loss[loss=0.2274, simple_loss=0.2854, pruned_loss=0.08468, over 21660.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2973, pruned_loss=0.07333, over 4248435.03 frames. ], batch size: 333, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:52:06,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1103622.0, ans=0.2 2023-06-24 12:52:08,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1103622.0, ans=0.125 2023-06-24 12:52:31,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-24 12:52:44,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-24 12:52:59,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.594e+02 2.897e+02 3.337e+02 7.292e+02, threshold=5.794e+02, percent-clipped=1.0 2023-06-24 12:53:07,710 INFO [train.py:996] (3/4) Epoch 7, batch 1000, loss[loss=0.1677, simple_loss=0.2516, pruned_loss=0.04188, over 21340.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2994, pruned_loss=0.07458, over 4257394.30 frames. ], batch size: 194, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:53:18,249 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:54:00,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1103922.0, ans=0.125 2023-06-24 12:54:09,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1103922.0, ans=0.125 2023-06-24 12:55:12,144 INFO [train.py:996] (3/4) Epoch 7, batch 1050, loss[loss=0.2241, simple_loss=0.2888, pruned_loss=0.07973, over 21313.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2983, pruned_loss=0.0754, over 4267516.25 frames. ], batch size: 176, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:55:41,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1104162.0, ans=0.125 2023-06-24 12:55:49,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1104222.0, ans=0.05 2023-06-24 12:56:39,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104342.0, ans=0.1 2023-06-24 12:56:42,635 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 2.809e+02 3.239e+02 3.685e+02 6.477e+02, threshold=6.478e+02, percent-clipped=3.0 2023-06-24 12:56:57,478 INFO [train.py:996] (3/4) Epoch 7, batch 1100, loss[loss=0.2241, simple_loss=0.3075, pruned_loss=0.07041, over 21744.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2964, pruned_loss=0.07418, over 4269074.76 frames. ], batch size: 414, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:57:19,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1104462.0, ans=0.2 2023-06-24 12:57:28,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.27 vs. limit=15.0 2023-06-24 12:57:42,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2023-06-24 12:58:00,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1104582.0, ans=0.0 2023-06-24 12:58:14,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1104582.0, ans=0.125 2023-06-24 12:58:33,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104642.0, ans=0.1 2023-06-24 12:58:48,070 INFO [train.py:996] (3/4) Epoch 7, batch 1150, loss[loss=0.2361, simple_loss=0.3122, pruned_loss=0.07998, over 21573.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2954, pruned_loss=0.07311, over 4269430.19 frames. ], batch size: 441, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 12:59:12,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1104762.0, ans=0.125 2023-06-24 12:59:42,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-24 12:59:55,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1104882.0, ans=0.0 2023-06-24 13:00:13,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1104882.0, ans=0.125 2023-06-24 13:00:27,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1104942.0, ans=0.2 2023-06-24 13:00:30,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 2.493e+02 2.841e+02 3.361e+02 6.236e+02, threshold=5.682e+02, percent-clipped=0.0 2023-06-24 13:00:38,720 INFO [train.py:996] (3/4) Epoch 7, batch 1200, loss[loss=0.2225, simple_loss=0.3012, pruned_loss=0.07189, over 21768.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2972, pruned_loss=0.07424, over 4272863.10 frames. ], batch size: 247, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:00:44,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1105002.0, ans=0.1 2023-06-24 13:00:48,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1105002.0, ans=0.2 2023-06-24 13:00:51,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1105002.0, ans=0.0 2023-06-24 13:01:03,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1105062.0, ans=0.125 2023-06-24 13:02:00,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1105182.0, ans=0.125 2023-06-24 13:02:28,499 INFO [train.py:996] (3/4) Epoch 7, batch 1250, loss[loss=0.2091, simple_loss=0.2874, pruned_loss=0.06537, over 21835.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2988, pruned_loss=0.07514, over 4276049.62 frames. ], batch size: 351, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:03:12,251 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-24 13:03:27,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1105422.0, ans=0.0 2023-06-24 13:03:59,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1105542.0, ans=0.125 2023-06-24 13:04:09,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.694e+02 3.114e+02 3.849e+02 5.488e+02, threshold=6.227e+02, percent-clipped=0.0 2023-06-24 13:04:18,057 INFO [train.py:996] (3/4) Epoch 7, batch 1300, loss[loss=0.2349, simple_loss=0.303, pruned_loss=0.08343, over 21928.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3005, pruned_loss=0.07508, over 4280126.34 frames. ], batch size: 113, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:04:39,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1105662.0, ans=0.0 2023-06-24 13:04:43,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1105662.0, ans=0.0 2023-06-24 13:06:06,814 INFO [train.py:996] (3/4) Epoch 7, batch 1350, loss[loss=0.2218, simple_loss=0.2889, pruned_loss=0.07737, over 21817.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3023, pruned_loss=0.07624, over 4282603.61 frames. ], batch size: 414, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:07:31,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1106082.0, ans=0.0 2023-06-24 13:07:40,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1106142.0, ans=0.125 2023-06-24 13:07:48,034 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.180e+02 2.498e+02 2.809e+02 3.151e+02 4.941e+02, threshold=5.617e+02, percent-clipped=0.0 2023-06-24 13:07:56,337 INFO [train.py:996] (3/4) Epoch 7, batch 1400, loss[loss=0.2148, simple_loss=0.2871, pruned_loss=0.07128, over 21651.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3004, pruned_loss=0.07566, over 4281729.73 frames. ], batch size: 247, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:08:01,115 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.91 vs. limit=10.0 2023-06-24 13:08:23,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.79 vs. limit=15.0 2023-06-24 13:08:42,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1106322.0, ans=0.2 2023-06-24 13:09:16,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1106382.0, ans=0.125 2023-06-24 13:09:30,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1106442.0, ans=0.125 2023-06-24 13:09:46,115 INFO [train.py:996] (3/4) Epoch 7, batch 1450, loss[loss=0.256, simple_loss=0.3358, pruned_loss=0.08806, over 21372.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3014, pruned_loss=0.07635, over 4275154.01 frames. ], batch size: 131, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 13:10:07,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1106562.0, ans=0.07 2023-06-24 13:10:12,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1106562.0, ans=0.125 2023-06-24 13:10:25,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1106622.0, ans=0.1 2023-06-24 13:10:48,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1106682.0, ans=0.125 2023-06-24 13:11:22,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1106742.0, ans=0.1 2023-06-24 13:11:28,956 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.734e+02 3.228e+02 3.700e+02 6.613e+02, threshold=6.455e+02, percent-clipped=4.0 2023-06-24 13:11:33,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1106742.0, ans=0.125 2023-06-24 13:11:36,320 INFO [train.py:996] (3/4) Epoch 7, batch 1500, loss[loss=0.2116, simple_loss=0.2938, pruned_loss=0.06473, over 17582.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3035, pruned_loss=0.07693, over 4275478.08 frames. ], batch size: 60, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 13:11:47,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1106802.0, ans=0.1 2023-06-24 13:11:57,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1106862.0, ans=0.015 2023-06-24 13:13:02,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1106982.0, ans=0.035 2023-06-24 13:13:24,204 INFO [train.py:996] (3/4) Epoch 7, batch 1550, loss[loss=0.2421, simple_loss=0.3125, pruned_loss=0.0859, over 21501.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3007, pruned_loss=0.0752, over 4276480.70 frames. ], batch size: 548, lr: 4.47e-03, grad_scale: 16.0 2023-06-24 13:14:45,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1107282.0, ans=0.0 2023-06-24 13:14:51,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1107282.0, ans=0.0 2023-06-24 13:14:55,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.81 vs. limit=10.0 2023-06-24 13:15:06,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.619e+02 3.008e+02 3.656e+02 5.850e+02, threshold=6.017e+02, percent-clipped=0.0 2023-06-24 13:15:09,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1107342.0, ans=0.2 2023-06-24 13:15:13,450 INFO [train.py:996] (3/4) Epoch 7, batch 1600, loss[loss=0.2807, simple_loss=0.3317, pruned_loss=0.1149, over 21340.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2988, pruned_loss=0.07421, over 4277013.43 frames. ], batch size: 507, lr: 4.47e-03, grad_scale: 32.0 2023-06-24 13:15:16,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1107402.0, ans=0.04949747468305833 2023-06-24 13:15:37,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1107462.0, ans=0.0 2023-06-24 13:16:34,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1107582.0, ans=0.125 2023-06-24 13:16:35,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-06-24 13:17:11,097 INFO [train.py:996] (3/4) Epoch 7, batch 1650, loss[loss=0.2129, simple_loss=0.2834, pruned_loss=0.07118, over 21159.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2996, pruned_loss=0.07367, over 4268972.02 frames. ], batch size: 608, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:17:23,957 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:18:11,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1107822.0, ans=0.125 2023-06-24 13:18:55,908 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.937e+02 2.769e+02 3.129e+02 3.705e+02 6.024e+02, threshold=6.259e+02, percent-clipped=1.0 2023-06-24 13:19:03,699 INFO [train.py:996] (3/4) Epoch 7, batch 1700, loss[loss=0.2326, simple_loss=0.3285, pruned_loss=0.06837, over 21856.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3012, pruned_loss=0.07453, over 4270578.61 frames. ], batch size: 316, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:19:48,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1108062.0, ans=0.125 2023-06-24 13:20:19,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1108182.0, ans=0.1 2023-06-24 13:20:26,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1108182.0, ans=0.2 2023-06-24 13:21:02,720 INFO [train.py:996] (3/4) Epoch 7, batch 1750, loss[loss=0.1765, simple_loss=0.2506, pruned_loss=0.05123, over 21264.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2997, pruned_loss=0.07323, over 4262940.71 frames. ], batch size: 159, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:21:16,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-24 13:22:01,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1108422.0, ans=0.0 2023-06-24 13:22:14,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1108482.0, ans=0.1 2023-06-24 13:22:56,563 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.880e+02 2.765e+02 3.316e+02 4.331e+02 7.357e+02, threshold=6.632e+02, percent-clipped=3.0 2023-06-24 13:23:07,037 INFO [train.py:996] (3/4) Epoch 7, batch 1800, loss[loss=0.2374, simple_loss=0.3249, pruned_loss=0.07493, over 21748.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2988, pruned_loss=0.07089, over 4270292.63 frames. ], batch size: 351, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:23:35,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1108662.0, ans=0.2 2023-06-24 13:24:25,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1108782.0, ans=0.04949747468305833 2023-06-24 13:24:38,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1108842.0, ans=0.125 2023-06-24 13:24:52,484 INFO [train.py:996] (3/4) Epoch 7, batch 1850, loss[loss=0.2277, simple_loss=0.3091, pruned_loss=0.07314, over 21841.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2982, pruned_loss=0.06939, over 4276110.86 frames. ], batch size: 371, lr: 4.46e-03, grad_scale: 8.0 2023-06-24 13:25:13,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1108962.0, ans=0.125 2023-06-24 13:25:49,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1109022.0, ans=0.125 2023-06-24 13:26:01,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1109082.0, ans=0.04949747468305833 2023-06-24 13:26:04,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=15.0 2023-06-24 13:26:30,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1109142.0, ans=10.0 2023-06-24 13:26:38,964 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.824e+02 2.879e+02 3.449e+02 4.316e+02 7.592e+02, threshold=6.898e+02, percent-clipped=3.0 2023-06-24 13:26:47,942 INFO [train.py:996] (3/4) Epoch 7, batch 1900, loss[loss=0.2806, simple_loss=0.3411, pruned_loss=0.1101, over 21637.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2987, pruned_loss=0.0695, over 4284325.57 frames. ], batch size: 507, lr: 4.46e-03, grad_scale: 8.0 2023-06-24 13:27:27,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-24 13:28:38,152 INFO [train.py:996] (3/4) Epoch 7, batch 1950, loss[loss=0.2215, simple_loss=0.294, pruned_loss=0.0745, over 21532.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2953, pruned_loss=0.06937, over 4289498.07 frames. ], batch size: 212, lr: 4.46e-03, grad_scale: 8.0 2023-06-24 13:28:39,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=15.0 2023-06-24 13:28:47,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1109502.0, ans=0.125 2023-06-24 13:28:56,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1109562.0, ans=0.125 2023-06-24 13:29:28,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1109622.0, ans=0.125 2023-06-24 13:30:12,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1109742.0, ans=0.0 2023-06-24 13:30:12,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1109742.0, ans=0.125 2023-06-24 13:30:26,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.664e+02 3.137e+02 3.840e+02 6.499e+02, threshold=6.275e+02, percent-clipped=0.0 2023-06-24 13:30:29,923 INFO [train.py:996] (3/4) Epoch 7, batch 2000, loss[loss=0.285, simple_loss=0.386, pruned_loss=0.09204, over 21183.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2918, pruned_loss=0.06811, over 4281151.56 frames. ], batch size: 548, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:31:13,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1109922.0, ans=0.125 2023-06-24 13:32:03,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1110042.0, ans=0.0 2023-06-24 13:32:20,834 INFO [train.py:996] (3/4) Epoch 7, batch 2050, loss[loss=0.2158, simple_loss=0.2936, pruned_loss=0.06904, over 21656.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2945, pruned_loss=0.06967, over 4286191.98 frames. ], batch size: 263, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:32:40,950 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-24 13:32:50,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1110162.0, ans=0.1 2023-06-24 13:34:07,172 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.697e+02 3.083e+02 3.787e+02 7.892e+02, threshold=6.165e+02, percent-clipped=1.0 2023-06-24 13:34:10,760 INFO [train.py:996] (3/4) Epoch 7, batch 2100, loss[loss=0.2347, simple_loss=0.3018, pruned_loss=0.08378, over 21600.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2978, pruned_loss=0.07153, over 4282519.19 frames. ], batch size: 414, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:34:37,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1110462.0, ans=0.1 2023-06-24 13:34:52,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1110522.0, ans=0.125 2023-06-24 13:35:27,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1110582.0, ans=0.0 2023-06-24 13:35:38,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1110642.0, ans=0.1 2023-06-24 13:35:40,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1110642.0, ans=0.125 2023-06-24 13:36:00,132 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.99 vs. limit=15.0 2023-06-24 13:36:02,102 INFO [train.py:996] (3/4) Epoch 7, batch 2150, loss[loss=0.2652, simple_loss=0.3009, pruned_loss=0.1147, over 21474.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2993, pruned_loss=0.07379, over 4288430.26 frames. ], batch size: 511, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:36:02,892 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:37:27,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1110882.0, ans=0.0 2023-06-24 13:37:44,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1110942.0, ans=0.125 2023-06-24 13:37:49,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.806e+02 3.490e+02 4.529e+02 7.299e+02, threshold=6.981e+02, percent-clipped=4.0 2023-06-24 13:37:52,588 INFO [train.py:996] (3/4) Epoch 7, batch 2200, loss[loss=0.1952, simple_loss=0.2805, pruned_loss=0.05492, over 21723.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3041, pruned_loss=0.07354, over 4285688.67 frames. ], batch size: 247, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:38:05,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1111002.0, ans=0.5 2023-06-24 13:39:14,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.25 vs. limit=12.0 2023-06-24 13:39:18,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1111242.0, ans=0.0 2023-06-24 13:39:40,169 INFO [train.py:996] (3/4) Epoch 7, batch 2250, loss[loss=0.2015, simple_loss=0.2848, pruned_loss=0.05913, over 21432.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2998, pruned_loss=0.07203, over 4278947.91 frames. ], batch size: 211, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:39:47,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1111302.0, ans=0.2 2023-06-24 13:40:01,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1111362.0, ans=0.125 2023-06-24 13:40:52,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1111482.0, ans=0.125 2023-06-24 13:40:59,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1111482.0, ans=0.1 2023-06-24 13:41:24,939 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.731e+02 3.125e+02 3.958e+02 6.138e+02, threshold=6.249e+02, percent-clipped=0.0 2023-06-24 13:41:27,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1111602.0, ans=0.125 2023-06-24 13:41:28,581 INFO [train.py:996] (3/4) Epoch 7, batch 2300, loss[loss=0.2075, simple_loss=0.2697, pruned_loss=0.07268, over 21764.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2952, pruned_loss=0.07114, over 4279382.02 frames. ], batch size: 112, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:41:40,355 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-24 13:41:43,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1111602.0, ans=0.0 2023-06-24 13:43:10,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=22.5 2023-06-24 13:43:15,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1111842.0, ans=0.125 2023-06-24 13:43:17,668 INFO [train.py:996] (3/4) Epoch 7, batch 2350, loss[loss=0.2206, simple_loss=0.2957, pruned_loss=0.07275, over 21283.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2937, pruned_loss=0.07233, over 4276807.35 frames. ], batch size: 159, lr: 4.46e-03, grad_scale: 16.0 2023-06-24 13:44:01,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1112022.0, ans=0.2 2023-06-24 13:44:01,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1112022.0, ans=0.1 2023-06-24 13:44:01,576 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-24 13:44:17,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1112022.0, ans=10.0 2023-06-24 13:44:42,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1112082.0, ans=0.0 2023-06-24 13:44:44,553 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:44:46,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1112142.0, ans=0.0 2023-06-24 13:45:05,380 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.752e+02 3.198e+02 3.763e+02 6.793e+02, threshold=6.396e+02, percent-clipped=2.0 2023-06-24 13:45:08,860 INFO [train.py:996] (3/4) Epoch 7, batch 2400, loss[loss=0.2008, simple_loss=0.2896, pruned_loss=0.05604, over 21683.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2965, pruned_loss=0.07397, over 4283148.47 frames. ], batch size: 263, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:45:17,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=22.5 2023-06-24 13:45:44,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1112262.0, ans=0.0 2023-06-24 13:45:46,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-24 13:45:53,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1112262.0, ans=0.0 2023-06-24 13:46:59,003 INFO [train.py:996] (3/4) Epoch 7, batch 2450, loss[loss=0.2247, simple_loss=0.2934, pruned_loss=0.07799, over 15757.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3002, pruned_loss=0.07525, over 4272604.24 frames. ], batch size: 62, lr: 4.46e-03, grad_scale: 32.0 2023-06-24 13:47:06,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1112502.0, ans=0.125 2023-06-24 13:47:10,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1112502.0, ans=0.1 2023-06-24 13:47:11,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1112502.0, ans=0.09899494936611666 2023-06-24 13:47:15,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1112562.0, ans=0.125 2023-06-24 13:47:19,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1112562.0, ans=0.0 2023-06-24 13:47:38,750 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.83 vs. limit=22.5 2023-06-24 13:48:13,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1112682.0, ans=0.1 2023-06-24 13:48:40,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1112742.0, ans=0.125 2023-06-24 13:48:46,854 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:48:48,358 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.774e+02 3.648e+02 4.607e+02 7.858e+02, threshold=7.296e+02, percent-clipped=5.0 2023-06-24 13:48:50,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1112802.0, ans=0.125 2023-06-24 13:48:51,789 INFO [train.py:996] (3/4) Epoch 7, batch 2500, loss[loss=0.2304, simple_loss=0.3182, pruned_loss=0.07132, over 21579.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2971, pruned_loss=0.074, over 4275399.08 frames. ], batch size: 414, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 13:49:27,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1112862.0, ans=0.0 2023-06-24 13:49:53,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-24 13:50:00,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1112982.0, ans=0.2 2023-06-24 13:50:42,259 INFO [train.py:996] (3/4) Epoch 7, batch 2550, loss[loss=0.2074, simple_loss=0.289, pruned_loss=0.06289, over 21330.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2969, pruned_loss=0.07315, over 4259160.12 frames. ], batch size: 144, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 13:52:06,237 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:52:22,883 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-24 13:52:26,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1113342.0, ans=0.125 2023-06-24 13:52:30,405 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.860e+02 3.358e+02 4.176e+02 6.278e+02, threshold=6.716e+02, percent-clipped=0.0 2023-06-24 13:52:32,019 INFO [train.py:996] (3/4) Epoch 7, batch 2600, loss[loss=0.2585, simple_loss=0.3346, pruned_loss=0.09119, over 21814.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2992, pruned_loss=0.07448, over 4260196.30 frames. ], batch size: 124, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:52:41,878 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:53:01,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1113462.0, ans=0.125 2023-06-24 13:53:21,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1113522.0, ans=0.1 2023-06-24 13:53:46,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=22.5 2023-06-24 13:54:07,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1113642.0, ans=0.0 2023-06-24 13:54:23,109 INFO [train.py:996] (3/4) Epoch 7, batch 2650, loss[loss=0.1943, simple_loss=0.2492, pruned_loss=0.06971, over 21183.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2988, pruned_loss=0.0746, over 4266402.40 frames. ], batch size: 548, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:55:27,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1113822.0, ans=0.1 2023-06-24 13:55:54,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1113882.0, ans=0.1 2023-06-24 13:56:12,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.624e+02 3.107e+02 3.655e+02 6.528e+02, threshold=6.215e+02, percent-clipped=0.0 2023-06-24 13:56:14,281 INFO [train.py:996] (3/4) Epoch 7, batch 2700, loss[loss=0.2302, simple_loss=0.3227, pruned_loss=0.06883, over 21338.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2964, pruned_loss=0.07392, over 4271855.74 frames. ], batch size: 548, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:56:22,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2023-06-24 13:56:57,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1114062.0, ans=0.125 2023-06-24 13:57:28,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1114182.0, ans=0.2 2023-06-24 13:58:04,697 INFO [train.py:996] (3/4) Epoch 7, batch 2750, loss[loss=0.2374, simple_loss=0.3258, pruned_loss=0.0745, over 21835.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2971, pruned_loss=0.0746, over 4273301.13 frames. ], batch size: 371, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 13:58:46,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1114362.0, ans=0.0 2023-06-24 13:59:11,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-24 13:59:23,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1114482.0, ans=0.5 2023-06-24 13:59:54,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1114542.0, ans=0.0 2023-06-24 14:00:01,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.964e+02 3.229e+02 3.808e+02 6.340e+02, threshold=6.458e+02, percent-clipped=1.0 2023-06-24 14:00:03,064 INFO [train.py:996] (3/4) Epoch 7, batch 2800, loss[loss=0.2534, simple_loss=0.3251, pruned_loss=0.09082, over 21609.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3022, pruned_loss=0.07583, over 4269563.54 frames. ], batch size: 441, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 14:00:10,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-24 14:00:30,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1114662.0, ans=0.5 2023-06-24 14:00:49,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-24 14:00:54,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1114722.0, ans=0.0 2023-06-24 14:01:13,939 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:01:17,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1114782.0, ans=0.0 2023-06-24 14:01:54,112 INFO [train.py:996] (3/4) Epoch 7, batch 2850, loss[loss=0.2293, simple_loss=0.3069, pruned_loss=0.07582, over 21738.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3036, pruned_loss=0.07724, over 4275326.71 frames. ], batch size: 391, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 14:02:07,788 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-06-24 14:02:19,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1114962.0, ans=0.125 2023-06-24 14:02:38,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1115022.0, ans=10.0 2023-06-24 14:02:47,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1115022.0, ans=0.0 2023-06-24 14:03:40,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1115142.0, ans=0.1 2023-06-24 14:03:41,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1115202.0, ans=0.125 2023-06-24 14:03:42,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.437e+02 2.854e+02 3.316e+02 3.985e+02 8.556e+02, threshold=6.632e+02, percent-clipped=4.0 2023-06-24 14:03:42,945 INFO [train.py:996] (3/4) Epoch 7, batch 2900, loss[loss=0.2111, simple_loss=0.2923, pruned_loss=0.06495, over 21723.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3011, pruned_loss=0.077, over 4280691.53 frames. ], batch size: 298, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:03:59,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1115202.0, ans=0.2 2023-06-24 14:04:29,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1115322.0, ans=0.0 2023-06-24 14:04:35,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1115322.0, ans=0.0 2023-06-24 14:04:40,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1115322.0, ans=0.125 2023-06-24 14:04:51,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1115382.0, ans=0.0 2023-06-24 14:04:53,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1115382.0, ans=0.125 2023-06-24 14:04:54,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1115382.0, ans=0.125 2023-06-24 14:04:56,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1115382.0, ans=0.125 2023-06-24 14:05:01,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1115382.0, ans=0.0 2023-06-24 14:05:20,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-24 14:05:33,560 INFO [train.py:996] (3/4) Epoch 7, batch 2950, loss[loss=0.2397, simple_loss=0.313, pruned_loss=0.08316, over 19942.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.303, pruned_loss=0.07721, over 4283313.73 frames. ], batch size: 702, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:05:46,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1115502.0, ans=0.125 2023-06-24 14:05:48,715 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:06:34,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1115622.0, ans=0.0 2023-06-24 14:06:38,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1115622.0, ans=0.0 2023-06-24 14:06:56,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1115682.0, ans=0.07 2023-06-24 14:07:15,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1115742.0, ans=0.05 2023-06-24 14:07:24,640 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.857e+02 3.209e+02 3.929e+02 8.381e+02, threshold=6.419e+02, percent-clipped=2.0 2023-06-24 14:07:24,676 INFO [train.py:996] (3/4) Epoch 7, batch 3000, loss[loss=0.2307, simple_loss=0.3393, pruned_loss=0.0611, over 20747.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.306, pruned_loss=0.07719, over 4289687.68 frames. ], batch size: 607, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:07:24,676 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 14:07:34,593 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7978, 3.9279, 3.7261, 3.9515], device='cuda:3') 2023-06-24 14:07:46,557 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2481, simple_loss=0.3407, pruned_loss=0.0778, over 1796401.00 frames. 2023-06-24 14:07:46,558 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23409MB 2023-06-24 14:08:04,593 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.21 vs. limit=10.0 2023-06-24 14:08:27,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1115862.0, ans=0.125 2023-06-24 14:08:28,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1115862.0, ans=0.07 2023-06-24 14:08:30,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1115862.0, ans=0.125 2023-06-24 14:08:32,662 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:08:45,436 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-24 14:09:25,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1116042.0, ans=0.125 2023-06-24 14:09:25,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1116042.0, ans=0.125 2023-06-24 14:09:37,529 INFO [train.py:996] (3/4) Epoch 7, batch 3050, loss[loss=0.2218, simple_loss=0.297, pruned_loss=0.07334, over 21399.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.307, pruned_loss=0.07613, over 4292293.40 frames. ], batch size: 176, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:09:47,720 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=12.0 2023-06-24 14:10:47,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1116282.0, ans=0.125 2023-06-24 14:11:14,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1116342.0, ans=0.125 2023-06-24 14:11:33,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.863e+02 2.533e+02 2.915e+02 3.819e+02 6.639e+02, threshold=5.830e+02, percent-clipped=1.0 2023-06-24 14:11:33,772 INFO [train.py:996] (3/4) Epoch 7, batch 3100, loss[loss=0.261, simple_loss=0.3475, pruned_loss=0.08723, over 21475.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3044, pruned_loss=0.07494, over 4289798.58 frames. ], batch size: 471, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:11:41,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=22.5 2023-06-24 14:12:47,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1116582.0, ans=0.2 2023-06-24 14:13:25,679 INFO [train.py:996] (3/4) Epoch 7, batch 3150, loss[loss=0.2743, simple_loss=0.3458, pruned_loss=0.1014, over 21293.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3046, pruned_loss=0.075, over 4288711.33 frames. ], batch size: 159, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:13:35,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1116702.0, ans=0.5 2023-06-24 14:13:35,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-24 14:13:42,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1116702.0, ans=0.125 2023-06-24 14:13:59,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-24 14:14:10,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1116822.0, ans=0.2 2023-06-24 14:15:05,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1116942.0, ans=0.1 2023-06-24 14:15:22,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.712e+02 3.098e+02 3.534e+02 5.991e+02, threshold=6.196e+02, percent-clipped=1.0 2023-06-24 14:15:22,192 INFO [train.py:996] (3/4) Epoch 7, batch 3200, loss[loss=0.2235, simple_loss=0.3028, pruned_loss=0.07209, over 20654.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3067, pruned_loss=0.07603, over 4286582.19 frames. ], batch size: 607, lr: 4.45e-03, grad_scale: 32.0 2023-06-24 14:15:50,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1117062.0, ans=0.0 2023-06-24 14:15:55,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-24 14:17:13,190 INFO [train.py:996] (3/4) Epoch 7, batch 3250, loss[loss=0.2497, simple_loss=0.2865, pruned_loss=0.1065, over 21405.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3069, pruned_loss=0.07762, over 4281576.37 frames. ], batch size: 510, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:17:17,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.42 vs. limit=15.0 2023-06-24 14:17:18,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1117302.0, ans=0.2 2023-06-24 14:17:22,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1117302.0, ans=0.125 2023-06-24 14:17:24,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1117302.0, ans=0.125 2023-06-24 14:18:26,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1117482.0, ans=0.125 2023-06-24 14:18:59,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1117542.0, ans=0.0 2023-06-24 14:19:05,821 INFO [train.py:996] (3/4) Epoch 7, batch 3300, loss[loss=0.1998, simple_loss=0.2945, pruned_loss=0.05254, over 19987.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3025, pruned_loss=0.07712, over 4276840.71 frames. ], batch size: 703, lr: 4.45e-03, grad_scale: 16.0 2023-06-24 14:19:06,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1117602.0, ans=0.125 2023-06-24 14:19:07,511 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 2.718e+02 3.384e+02 4.609e+02 8.476e+02, threshold=6.767e+02, percent-clipped=13.0 2023-06-24 14:19:08,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1117602.0, ans=0.1 2023-06-24 14:19:36,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1117662.0, ans=0.0 2023-06-24 14:20:56,360 INFO [train.py:996] (3/4) Epoch 7, batch 3350, loss[loss=0.2052, simple_loss=0.2843, pruned_loss=0.06309, over 21862.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3065, pruned_loss=0.07848, over 4279207.06 frames. ], batch size: 282, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:21:09,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1117902.0, ans=0.0 2023-06-24 14:21:11,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1117902.0, ans=0.1 2023-06-24 14:21:13,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1117902.0, ans=0.0 2023-06-24 14:21:13,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1117902.0, ans=0.125 2023-06-24 14:21:17,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1117902.0, ans=15.0 2023-06-24 14:22:22,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=22.5 2023-06-24 14:22:27,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1118082.0, ans=0.1 2023-06-24 14:22:27,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1118082.0, ans=0.125 2023-06-24 14:22:53,115 INFO [train.py:996] (3/4) Epoch 7, batch 3400, loss[loss=0.2291, simple_loss=0.3012, pruned_loss=0.07851, over 21321.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3073, pruned_loss=0.0793, over 4279189.63 frames. ], batch size: 159, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:22:54,389 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-24 14:22:54,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.33 vs. limit=6.0 2023-06-24 14:22:54,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.810e+02 3.179e+02 3.983e+02 5.568e+02, threshold=6.357e+02, percent-clipped=0.0 2023-06-24 14:22:55,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1118202.0, ans=0.0 2023-06-24 14:23:09,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1118262.0, ans=0.125 2023-06-24 14:24:05,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-24 14:24:30,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-24 14:24:43,507 INFO [train.py:996] (3/4) Epoch 7, batch 3450, loss[loss=0.2868, simple_loss=0.3705, pruned_loss=0.1015, over 21734.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3022, pruned_loss=0.07805, over 4278921.09 frames. ], batch size: 351, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:24:59,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1118502.0, ans=0.1 2023-06-24 14:25:18,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1118562.0, ans=0.0 2023-06-24 14:25:48,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1118622.0, ans=0.0 2023-06-24 14:25:57,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1118682.0, ans=0.125 2023-06-24 14:26:36,916 INFO [train.py:996] (3/4) Epoch 7, batch 3500, loss[loss=0.2627, simple_loss=0.3352, pruned_loss=0.09513, over 21481.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3115, pruned_loss=0.08138, over 4275420.62 frames. ], batch size: 194, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:26:38,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.694e+02 2.966e+02 3.710e+02 5.580e+02, threshold=5.932e+02, percent-clipped=0.0 2023-06-24 14:26:59,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1118802.0, ans=0.0 2023-06-24 14:27:11,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1118862.0, ans=0.125 2023-06-24 14:27:57,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1118982.0, ans=0.0 2023-06-24 14:28:07,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1118982.0, ans=0.125 2023-06-24 14:28:16,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-24 14:28:33,478 INFO [train.py:996] (3/4) Epoch 7, batch 3550, loss[loss=0.2102, simple_loss=0.2707, pruned_loss=0.07486, over 21210.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3155, pruned_loss=0.08259, over 4275732.24 frames. ], batch size: 176, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:28:49,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1119102.0, ans=0.2 2023-06-24 14:29:16,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1119162.0, ans=0.125 2023-06-24 14:29:28,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1119222.0, ans=0.0 2023-06-24 14:29:54,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1119282.0, ans=0.09899494936611666 2023-06-24 14:30:23,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1119402.0, ans=0.0 2023-06-24 14:30:24,484 INFO [train.py:996] (3/4) Epoch 7, batch 3600, loss[loss=0.1982, simple_loss=0.2588, pruned_loss=0.06879, over 21496.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3082, pruned_loss=0.08108, over 4279945.34 frames. ], batch size: 230, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:30:31,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 2.873e+02 3.282e+02 3.993e+02 6.971e+02, threshold=6.565e+02, percent-clipped=2.0 2023-06-24 14:30:35,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1119402.0, ans=0.125 2023-06-24 14:30:43,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2023-06-24 14:31:40,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1119582.0, ans=0.0 2023-06-24 14:32:22,654 INFO [train.py:996] (3/4) Epoch 7, batch 3650, loss[loss=0.2454, simple_loss=0.3209, pruned_loss=0.085, over 21860.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3095, pruned_loss=0.08136, over 4282833.57 frames. ], batch size: 118, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:33:20,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1119882.0, ans=0.125 2023-06-24 14:34:06,791 INFO [train.py:996] (3/4) Epoch 7, batch 3700, loss[loss=0.2133, simple_loss=0.2888, pruned_loss=0.06893, over 21495.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3088, pruned_loss=0.08028, over 4283659.80 frames. ], batch size: 212, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:34:08,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 2.822e+02 3.276e+02 3.785e+02 7.589e+02, threshold=6.551e+02, percent-clipped=1.0 2023-06-24 14:34:16,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1120002.0, ans=0.125 2023-06-24 14:35:23,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1120182.0, ans=0.125 2023-06-24 14:35:42,192 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-24 14:36:02,198 INFO [train.py:996] (3/4) Epoch 7, batch 3750, loss[loss=0.2181, simple_loss=0.3021, pruned_loss=0.06707, over 21649.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3061, pruned_loss=0.07939, over 4288989.39 frames. ], batch size: 441, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:36:04,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1120302.0, ans=0.125 2023-06-24 14:37:08,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1120482.0, ans=0.0 2023-06-24 14:37:10,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1120482.0, ans=0.1 2023-06-24 14:37:57,859 INFO [train.py:996] (3/4) Epoch 7, batch 3800, loss[loss=0.2633, simple_loss=0.3517, pruned_loss=0.08746, over 21800.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3027, pruned_loss=0.07737, over 4285917.05 frames. ], batch size: 118, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:38:01,825 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.713e+02 3.064e+02 3.470e+02 5.470e+02, threshold=6.128e+02, percent-clipped=0.0 2023-06-24 14:38:10,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1120602.0, ans=0.125 2023-06-24 14:38:17,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1120662.0, ans=0.0 2023-06-24 14:39:44,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1120842.0, ans=15.0 2023-06-24 14:39:49,659 INFO [train.py:996] (3/4) Epoch 7, batch 3850, loss[loss=0.2055, simple_loss=0.2651, pruned_loss=0.07294, over 21265.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3008, pruned_loss=0.0773, over 4285550.40 frames. ], batch size: 159, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:39:51,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1120902.0, ans=0.015 2023-06-24 14:40:14,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.06 vs. limit=15.0 2023-06-24 14:40:36,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1121022.0, ans=0.125 2023-06-24 14:41:33,214 INFO [train.py:996] (3/4) Epoch 7, batch 3900, loss[loss=0.2121, simple_loss=0.2733, pruned_loss=0.07545, over 21588.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2961, pruned_loss=0.07653, over 4286643.27 frames. ], batch size: 263, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:41:36,450 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.710e+02 3.145e+02 3.584e+02 6.226e+02, threshold=6.291e+02, percent-clipped=1.0 2023-06-24 14:42:22,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1121322.0, ans=0.125 2023-06-24 14:42:56,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1121382.0, ans=0.125 2023-06-24 14:43:07,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1121442.0, ans=0.125 2023-06-24 14:43:25,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1121442.0, ans=0.07 2023-06-24 14:43:28,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1121442.0, ans=0.2 2023-06-24 14:43:31,393 INFO [train.py:996] (3/4) Epoch 7, batch 3950, loss[loss=0.1458, simple_loss=0.2227, pruned_loss=0.03442, over 21116.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2983, pruned_loss=0.07617, over 4283857.69 frames. ], batch size: 143, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:43:43,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1121502.0, ans=0.125 2023-06-24 14:44:10,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1121562.0, ans=0.1 2023-06-24 14:45:00,747 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-24 14:45:18,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1121742.0, ans=0.05 2023-06-24 14:45:22,787 INFO [train.py:996] (3/4) Epoch 7, batch 4000, loss[loss=0.1666, simple_loss=0.2463, pruned_loss=0.04346, over 21758.00 frames. ], tot_loss[loss=0.22, simple_loss=0.293, pruned_loss=0.07346, over 4274650.82 frames. ], batch size: 316, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:45:26,593 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.561e+02 2.888e+02 3.482e+02 6.063e+02, threshold=5.775e+02, percent-clipped=0.0 2023-06-24 14:46:38,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1121982.0, ans=15.0 2023-06-24 14:47:13,473 INFO [train.py:996] (3/4) Epoch 7, batch 4050, loss[loss=0.2115, simple_loss=0.3097, pruned_loss=0.0567, over 21738.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2916, pruned_loss=0.07139, over 4279525.47 frames. ], batch size: 282, lr: 4.44e-03, grad_scale: 32.0 2023-06-24 14:49:04,259 INFO [train.py:996] (3/4) Epoch 7, batch 4100, loss[loss=0.2407, simple_loss=0.3019, pruned_loss=0.08972, over 21553.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2927, pruned_loss=0.07164, over 4277146.47 frames. ], batch size: 548, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:49:08,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.546e+02 2.998e+02 3.545e+02 8.551e+02, threshold=5.997e+02, percent-clipped=3.0 2023-06-24 14:49:23,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1122402.0, ans=0.125 2023-06-24 14:49:39,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1122462.0, ans=0.2 2023-06-24 14:50:54,045 INFO [train.py:996] (3/4) Epoch 7, batch 4150, loss[loss=0.2657, simple_loss=0.3457, pruned_loss=0.09287, over 21549.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2934, pruned_loss=0.06947, over 4274747.83 frames. ], batch size: 508, lr: 4.44e-03, grad_scale: 16.0 2023-06-24 14:52:02,253 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:52:40,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1122942.0, ans=0.0 2023-06-24 14:52:46,856 INFO [train.py:996] (3/4) Epoch 7, batch 4200, loss[loss=0.1865, simple_loss=0.2769, pruned_loss=0.04798, over 21499.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2942, pruned_loss=0.07005, over 4268401.38 frames. ], batch size: 212, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:52:57,885 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.672e+02 2.976e+02 3.504e+02 5.360e+02, threshold=5.952e+02, percent-clipped=0.0 2023-06-24 14:53:39,171 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:54:04,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1123182.0, ans=0.1 2023-06-24 14:54:16,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1123182.0, ans=0.0 2023-06-24 14:54:23,741 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:54:42,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1123242.0, ans=0.125 2023-06-24 14:54:42,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-24 14:54:45,208 INFO [train.py:996] (3/4) Epoch 7, batch 4250, loss[loss=0.2621, simple_loss=0.3363, pruned_loss=0.09393, over 20690.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3034, pruned_loss=0.07259, over 4269961.70 frames. ], batch size: 607, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:55:18,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1123362.0, ans=0.125 2023-06-24 14:55:38,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1123422.0, ans=22.5 2023-06-24 14:55:52,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1123422.0, ans=10.0 2023-06-24 14:56:31,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1123542.0, ans=0.07 2023-06-24 14:56:43,614 INFO [train.py:996] (3/4) Epoch 7, batch 4300, loss[loss=0.2277, simple_loss=0.3121, pruned_loss=0.07167, over 21751.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3073, pruned_loss=0.07415, over 4268823.25 frames. ], batch size: 118, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:56:47,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1123602.0, ans=0.1 2023-06-24 14:56:48,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 3.063e+02 3.693e+02 4.827e+02 7.345e+02, threshold=7.385e+02, percent-clipped=7.0 2023-06-24 14:57:48,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1123782.0, ans=0.125 2023-06-24 14:57:58,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1123782.0, ans=0.2 2023-06-24 14:58:01,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2023-06-24 14:58:39,601 INFO [train.py:996] (3/4) Epoch 7, batch 4350, loss[loss=0.2191, simple_loss=0.2904, pruned_loss=0.07387, over 21602.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3075, pruned_loss=0.07404, over 4262111.52 frames. ], batch size: 414, lr: 4.43e-03, grad_scale: 16.0 2023-06-24 14:59:10,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.78 vs. limit=6.0 2023-06-24 14:59:16,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1123962.0, ans=0.1 2023-06-24 15:00:35,510 INFO [train.py:996] (3/4) Epoch 7, batch 4400, loss[loss=0.2555, simple_loss=0.3224, pruned_loss=0.09429, over 19903.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3034, pruned_loss=0.07388, over 4251642.05 frames. ], batch size: 702, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:00:41,287 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.891e+02 3.329e+02 4.006e+02 7.259e+02, threshold=6.659e+02, percent-clipped=0.0 2023-06-24 15:00:41,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1124202.0, ans=0.1 2023-06-24 15:00:43,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1124202.0, ans=0.125 2023-06-24 15:01:58,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1124442.0, ans=0.125 2023-06-24 15:02:25,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1124442.0, ans=0.0 2023-06-24 15:02:25,722 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-24 15:02:28,239 INFO [train.py:996] (3/4) Epoch 7, batch 4450, loss[loss=0.2328, simple_loss=0.3156, pruned_loss=0.07502, over 21441.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3083, pruned_loss=0.0748, over 4256866.01 frames. ], batch size: 211, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:04:20,255 INFO [train.py:996] (3/4) Epoch 7, batch 4500, loss[loss=0.2203, simple_loss=0.3101, pruned_loss=0.0653, over 20760.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3111, pruned_loss=0.07714, over 4264133.27 frames. ], batch size: 607, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:04:25,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.933e+02 3.595e+02 4.328e+02 6.220e+02, threshold=7.189e+02, percent-clipped=0.0 2023-06-24 15:05:03,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1124922.0, ans=0.125 2023-06-24 15:05:55,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1125042.0, ans=0.0 2023-06-24 15:06:10,754 INFO [train.py:996] (3/4) Epoch 7, batch 4550, loss[loss=0.2855, simple_loss=0.3585, pruned_loss=0.1062, over 21581.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3125, pruned_loss=0.07679, over 4265714.83 frames. ], batch size: 414, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:06:42,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1125162.0, ans=0.125 2023-06-24 15:07:01,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1125222.0, ans=0.2 2023-06-24 15:07:41,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1125342.0, ans=0.1 2023-06-24 15:07:43,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1125342.0, ans=0.125 2023-06-24 15:07:52,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-24 15:07:56,748 INFO [train.py:996] (3/4) Epoch 7, batch 4600, loss[loss=0.2099, simple_loss=0.2856, pruned_loss=0.06712, over 21426.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3134, pruned_loss=0.07791, over 4273964.51 frames. ], batch size: 211, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:07:57,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1125402.0, ans=0.125 2023-06-24 15:07:59,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1125402.0, ans=0.125 2023-06-24 15:08:02,269 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 3.095e+02 3.765e+02 5.007e+02 9.113e+02, threshold=7.530e+02, percent-clipped=6.0 2023-06-24 15:08:50,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1125522.0, ans=0.0 2023-06-24 15:08:51,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=15.0 2023-06-24 15:09:01,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.41 vs. limit=6.0 2023-06-24 15:09:40,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=22.5 2023-06-24 15:09:45,815 INFO [train.py:996] (3/4) Epoch 7, batch 4650, loss[loss=0.1732, simple_loss=0.2499, pruned_loss=0.04824, over 21609.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.308, pruned_loss=0.07545, over 4278362.04 frames. ], batch size: 263, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:09:50,338 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-24 15:11:13,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1125882.0, ans=0.125 2023-06-24 15:11:35,512 INFO [train.py:996] (3/4) Epoch 7, batch 4700, loss[loss=0.2146, simple_loss=0.2756, pruned_loss=0.07684, over 21279.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2977, pruned_loss=0.07304, over 4278490.34 frames. ], batch size: 159, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:11:36,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1126002.0, ans=0.0 2023-06-24 15:11:45,716 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.572e+02 2.876e+02 3.232e+02 6.204e+02, threshold=5.752e+02, percent-clipped=0.0 2023-06-24 15:11:59,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1126062.0, ans=0.04949747468305833 2023-06-24 15:12:40,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1126122.0, ans=0.1 2023-06-24 15:12:45,316 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:13:05,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1126242.0, ans=0.1 2023-06-24 15:13:05,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1126242.0, ans=0.07 2023-06-24 15:13:05,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1126242.0, ans=0.05 2023-06-24 15:13:17,049 INFO [train.py:996] (3/4) Epoch 7, batch 4750, loss[loss=0.2083, simple_loss=0.2786, pruned_loss=0.06898, over 21697.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2934, pruned_loss=0.07342, over 4285674.12 frames. ], batch size: 298, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:13:34,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1126302.0, ans=0.125 2023-06-24 15:13:35,848 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:15:13,765 INFO [train.py:996] (3/4) Epoch 7, batch 4800, loss[loss=0.2667, simple_loss=0.3196, pruned_loss=0.1069, over 21547.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2963, pruned_loss=0.07449, over 4289782.56 frames. ], batch size: 473, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:15:19,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.780e+02 3.342e+02 3.933e+02 6.055e+02, threshold=6.684e+02, percent-clipped=1.0 2023-06-24 15:15:39,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1126662.0, ans=0.125 2023-06-24 15:16:05,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1126722.0, ans=0.125 2023-06-24 15:16:05,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1126722.0, ans=0.0 2023-06-24 15:16:17,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-06-24 15:16:23,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1126782.0, ans=0.125 2023-06-24 15:16:31,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1126782.0, ans=0.0 2023-06-24 15:16:39,940 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:16:45,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1126842.0, ans=0.125 2023-06-24 15:16:59,126 INFO [train.py:996] (3/4) Epoch 7, batch 4850, loss[loss=0.2187, simple_loss=0.3288, pruned_loss=0.05424, over 20947.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.296, pruned_loss=0.0737, over 4288505.88 frames. ], batch size: 608, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:18:32,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1127142.0, ans=0.0 2023-06-24 15:18:50,501 INFO [train.py:996] (3/4) Epoch 7, batch 4900, loss[loss=0.2437, simple_loss=0.3407, pruned_loss=0.07337, over 21722.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2985, pruned_loss=0.0739, over 4271429.32 frames. ], batch size: 298, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:18:55,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.644e+02 3.017e+02 3.473e+02 6.026e+02, threshold=6.033e+02, percent-clipped=0.0 2023-06-24 15:19:01,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1127202.0, ans=0.0 2023-06-24 15:20:24,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1127442.0, ans=0.125 2023-06-24 15:20:32,936 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:20:41,534 INFO [train.py:996] (3/4) Epoch 7, batch 4950, loss[loss=0.2638, simple_loss=0.3585, pruned_loss=0.08454, over 21648.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3023, pruned_loss=0.07391, over 4268649.89 frames. ], batch size: 441, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:21:23,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-24 15:21:33,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1127622.0, ans=0.1 2023-06-24 15:22:30,505 INFO [train.py:996] (3/4) Epoch 7, batch 5000, loss[loss=0.2491, simple_loss=0.321, pruned_loss=0.08863, over 21769.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3013, pruned_loss=0.07116, over 4272639.59 frames. ], batch size: 112, lr: 4.43e-03, grad_scale: 32.0 2023-06-24 15:22:35,396 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.510e+02 2.912e+02 3.367e+02 5.959e+02, threshold=5.824e+02, percent-clipped=0.0 2023-06-24 15:22:37,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1127802.0, ans=0.125 2023-06-24 15:23:47,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-24 15:23:56,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1128042.0, ans=15.0 2023-06-24 15:24:19,986 INFO [train.py:996] (3/4) Epoch 7, batch 5050, loss[loss=0.2168, simple_loss=0.2997, pruned_loss=0.06692, over 19976.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3004, pruned_loss=0.07244, over 4276638.45 frames. ], batch size: 702, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:26:10,466 INFO [train.py:996] (3/4) Epoch 7, batch 5100, loss[loss=0.1893, simple_loss=0.263, pruned_loss=0.05778, over 21825.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2987, pruned_loss=0.07195, over 4282970.66 frames. ], batch size: 124, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:26:17,208 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.691e+02 3.129e+02 3.589e+02 6.328e+02, threshold=6.257e+02, percent-clipped=2.0 2023-06-24 15:26:32,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-24 15:26:57,223 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-24 15:28:00,593 INFO [train.py:996] (3/4) Epoch 7, batch 5150, loss[loss=0.2342, simple_loss=0.3131, pruned_loss=0.07765, over 21819.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2976, pruned_loss=0.07293, over 4287965.69 frames. ], batch size: 332, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:28:01,411 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:28:10,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1128702.0, ans=0.0 2023-06-24 15:28:33,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.39 vs. limit=22.5 2023-06-24 15:29:04,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1128822.0, ans=0.125 2023-06-24 15:29:52,274 INFO [train.py:996] (3/4) Epoch 7, batch 5200, loss[loss=0.2244, simple_loss=0.3118, pruned_loss=0.06845, over 21303.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.302, pruned_loss=0.07339, over 4279042.41 frames. ], batch size: 176, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:29:59,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 2.769e+02 3.246e+02 4.133e+02 8.749e+02, threshold=6.492e+02, percent-clipped=7.0 2023-06-24 15:30:40,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1129062.0, ans=0.125 2023-06-24 15:30:49,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1129122.0, ans=0.0 2023-06-24 15:30:51,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1129122.0, ans=0.125 2023-06-24 15:30:55,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1129122.0, ans=0.2 2023-06-24 15:31:34,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1129242.0, ans=0.125 2023-06-24 15:31:36,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.62 vs. limit=22.5 2023-06-24 15:31:41,108 INFO [train.py:996] (3/4) Epoch 7, batch 5250, loss[loss=0.2028, simple_loss=0.2966, pruned_loss=0.05452, over 21649.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3054, pruned_loss=0.07233, over 4273795.30 frames. ], batch size: 263, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:32:35,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1129422.0, ans=0.0 2023-06-24 15:32:42,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1129422.0, ans=0.1 2023-06-24 15:32:47,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1129422.0, ans=0.125 2023-06-24 15:33:08,171 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-06-24 15:33:24,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1129542.0, ans=0.125 2023-06-24 15:33:24,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1129542.0, ans=0.0 2023-06-24 15:33:28,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1129542.0, ans=0.02 2023-06-24 15:33:31,870 INFO [train.py:996] (3/4) Epoch 7, batch 5300, loss[loss=0.2214, simple_loss=0.2856, pruned_loss=0.0786, over 21587.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3043, pruned_loss=0.07312, over 4284303.30 frames. ], batch size: 548, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:33:38,414 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 2.522e+02 2.825e+02 3.420e+02 5.349e+02, threshold=5.650e+02, percent-clipped=0.0 2023-06-24 15:33:40,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1129602.0, ans=0.125 2023-06-24 15:34:00,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1129662.0, ans=0.05 2023-06-24 15:34:28,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1129722.0, ans=0.0 2023-06-24 15:34:36,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1129782.0, ans=0.125 2023-06-24 15:35:09,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1129842.0, ans=0.125 2023-06-24 15:35:17,632 INFO [train.py:996] (3/4) Epoch 7, batch 5350, loss[loss=0.2284, simple_loss=0.2924, pruned_loss=0.08217, over 21757.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3028, pruned_loss=0.07477, over 4287913.16 frames. ], batch size: 441, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:37:07,118 INFO [train.py:996] (3/4) Epoch 7, batch 5400, loss[loss=0.252, simple_loss=0.316, pruned_loss=0.09396, over 21689.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3012, pruned_loss=0.0752, over 4292965.73 frames. ], batch size: 508, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:37:16,428 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.748e+02 3.020e+02 3.535e+02 6.573e+02, threshold=6.041e+02, percent-clipped=2.0 2023-06-24 15:37:58,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1130322.0, ans=0.5 2023-06-24 15:38:59,040 INFO [train.py:996] (3/4) Epoch 7, batch 5450, loss[loss=0.2503, simple_loss=0.3577, pruned_loss=0.07143, over 21736.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3013, pruned_loss=0.07312, over 4291799.54 frames. ], batch size: 332, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:39:05,487 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.14 vs. limit=22.5 2023-06-24 15:39:05,545 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.05 vs. limit=15.0 2023-06-24 15:39:27,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1130562.0, ans=0.0 2023-06-24 15:39:40,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1130622.0, ans=0.2 2023-06-24 15:39:58,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1130622.0, ans=0.125 2023-06-24 15:40:46,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1130742.0, ans=0.125 2023-06-24 15:40:50,125 INFO [train.py:996] (3/4) Epoch 7, batch 5500, loss[loss=0.316, simple_loss=0.3908, pruned_loss=0.1205, over 21468.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3049, pruned_loss=0.07046, over 4288946.46 frames. ], batch size: 507, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:40:58,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.932e+02 2.852e+02 3.783e+02 5.353e+02 8.274e+02, threshold=7.565e+02, percent-clipped=13.0 2023-06-24 15:42:40,372 INFO [train.py:996] (3/4) Epoch 7, batch 5550, loss[loss=0.1896, simple_loss=0.2823, pruned_loss=0.04844, over 21701.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3052, pruned_loss=0.06821, over 4280923.22 frames. ], batch size: 247, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:42:42,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1131102.0, ans=0.2 2023-06-24 15:43:47,990 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:43:48,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-24 15:44:31,788 INFO [train.py:996] (3/4) Epoch 7, batch 5600, loss[loss=0.2337, simple_loss=0.336, pruned_loss=0.06574, over 21878.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3058, pruned_loss=0.0679, over 4280775.80 frames. ], batch size: 317, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:44:45,772 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 2.480e+02 2.959e+02 3.871e+02 8.894e+02, threshold=5.918e+02, percent-clipped=1.0 2023-06-24 15:45:10,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1131522.0, ans=0.04949747468305833 2023-06-24 15:45:12,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1131522.0, ans=0.1 2023-06-24 15:45:58,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.50 vs. limit=15.0 2023-06-24 15:46:19,841 INFO [train.py:996] (3/4) Epoch 7, batch 5650, loss[loss=0.2086, simple_loss=0.2874, pruned_loss=0.06489, over 21893.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3078, pruned_loss=0.06954, over 4288492.96 frames. ], batch size: 316, lr: 4.42e-03, grad_scale: 32.0 2023-06-24 15:46:34,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1131702.0, ans=0.0 2023-06-24 15:46:34,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1131702.0, ans=0.0 2023-06-24 15:46:45,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1131762.0, ans=0.125 2023-06-24 15:46:48,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-24 15:46:49,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1131762.0, ans=0.125 2023-06-24 15:47:08,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1131822.0, ans=0.125 2023-06-24 15:48:05,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1131942.0, ans=0.0 2023-06-24 15:48:15,335 INFO [train.py:996] (3/4) Epoch 7, batch 5700, loss[loss=0.2416, simple_loss=0.3243, pruned_loss=0.07941, over 21639.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3066, pruned_loss=0.07066, over 4292035.33 frames. ], batch size: 441, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:48:26,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.629e+02 3.066e+02 3.731e+02 7.827e+02, threshold=6.133e+02, percent-clipped=4.0 2023-06-24 15:49:00,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1132122.0, ans=0.04949747468305833 2023-06-24 15:49:11,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1132122.0, ans=0.0 2023-06-24 15:49:21,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1132122.0, ans=0.125 2023-06-24 15:49:23,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1132182.0, ans=0.2 2023-06-24 15:49:23,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-24 15:49:59,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1132242.0, ans=0.125 2023-06-24 15:50:06,421 INFO [train.py:996] (3/4) Epoch 7, batch 5750, loss[loss=0.1843, simple_loss=0.2839, pruned_loss=0.0423, over 21748.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.302, pruned_loss=0.06793, over 4281993.02 frames. ], batch size: 332, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:50:21,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-24 15:50:53,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1132362.0, ans=0.0 2023-06-24 15:51:18,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1132482.0, ans=0.0 2023-06-24 15:51:43,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1132542.0, ans=0.125 2023-06-24 15:51:56,251 INFO [train.py:996] (3/4) Epoch 7, batch 5800, loss[loss=0.2266, simple_loss=0.3216, pruned_loss=0.06575, over 21784.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.3008, pruned_loss=0.06649, over 4273662.58 frames. ], batch size: 282, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:52:12,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.681e+02 3.323e+02 4.302e+02 6.884e+02, threshold=6.646e+02, percent-clipped=1.0 2023-06-24 15:52:50,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-06-24 15:53:58,476 INFO [train.py:996] (3/4) Epoch 7, batch 5850, loss[loss=0.1657, simple_loss=0.2737, pruned_loss=0.02888, over 21757.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2985, pruned_loss=0.06251, over 4281716.67 frames. ], batch size: 332, lr: 4.42e-03, grad_scale: 16.0 2023-06-24 15:54:28,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1132962.0, ans=0.125 2023-06-24 15:55:10,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1133082.0, ans=0.1 2023-06-24 15:55:51,561 INFO [train.py:996] (3/4) Epoch 7, batch 5900, loss[loss=0.1715, simple_loss=0.2497, pruned_loss=0.04663, over 21584.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2911, pruned_loss=0.05781, over 4286255.99 frames. ], batch size: 211, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 15:56:01,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.413e+02 2.024e+02 2.372e+02 2.933e+02 6.586e+02, threshold=4.744e+02, percent-clipped=0.0 2023-06-24 15:56:13,686 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-24 15:56:16,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1133262.0, ans=0.2 2023-06-24 15:56:21,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1133262.0, ans=0.1 2023-06-24 15:56:25,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.26 vs. limit=15.0 2023-06-24 15:56:53,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1133382.0, ans=0.125 2023-06-24 15:56:58,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1133382.0, ans=0.1 2023-06-24 15:57:18,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.52 vs. limit=22.5 2023-06-24 15:57:39,661 INFO [train.py:996] (3/4) Epoch 7, batch 5950, loss[loss=0.1887, simple_loss=0.2673, pruned_loss=0.055, over 21806.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2899, pruned_loss=0.06035, over 4290755.28 frames. ], batch size: 247, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 15:57:42,699 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-24 15:58:01,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-06-24 15:58:47,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1133682.0, ans=0.125 2023-06-24 15:59:03,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1133742.0, ans=0.125 2023-06-24 15:59:27,087 INFO [train.py:996] (3/4) Epoch 7, batch 6000, loss[loss=0.2189, simple_loss=0.2792, pruned_loss=0.07932, over 21786.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2877, pruned_loss=0.06365, over 4262710.23 frames. ], batch size: 112, lr: 4.41e-03, grad_scale: 32.0 2023-06-24 15:59:27,088 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 15:59:44,453 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2613, simple_loss=0.3539, pruned_loss=0.08436, over 1796401.00 frames. 2023-06-24 15:59:44,454 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23409MB 2023-06-24 15:59:57,238 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 3.144e+02 3.731e+02 4.665e+02 6.977e+02, threshold=7.462e+02, percent-clipped=24.0 2023-06-24 16:00:23,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1133862.0, ans=0.04949747468305833 2023-06-24 16:01:03,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1133982.0, ans=0.125 2023-06-24 16:01:05,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1133982.0, ans=0.125 2023-06-24 16:01:36,698 INFO [train.py:996] (3/4) Epoch 7, batch 6050, loss[loss=0.194, simple_loss=0.2656, pruned_loss=0.06122, over 21503.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2859, pruned_loss=0.06413, over 4262179.15 frames. ], batch size: 441, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:01:58,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1134162.0, ans=0.125 2023-06-24 16:03:27,728 INFO [train.py:996] (3/4) Epoch 7, batch 6100, loss[loss=0.2276, simple_loss=0.3026, pruned_loss=0.07631, over 21800.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2846, pruned_loss=0.06312, over 4270101.12 frames. ], batch size: 112, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:03:37,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-24 16:03:39,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.866e+02 2.425e+02 2.947e+02 3.693e+02 6.413e+02, threshold=5.895e+02, percent-clipped=0.0 2023-06-24 16:03:55,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1134462.0, ans=6.0 2023-06-24 16:04:00,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1134462.0, ans=0.125 2023-06-24 16:04:12,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1134522.0, ans=0.125 2023-06-24 16:04:29,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1134582.0, ans=0.125 2023-06-24 16:05:03,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1134642.0, ans=0.125 2023-06-24 16:05:17,145 INFO [train.py:996] (3/4) Epoch 7, batch 6150, loss[loss=0.1955, simple_loss=0.272, pruned_loss=0.05953, over 21594.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2856, pruned_loss=0.06548, over 4271514.03 frames. ], batch size: 230, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:06:14,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1134822.0, ans=0.125 2023-06-24 16:07:05,683 INFO [train.py:996] (3/4) Epoch 7, batch 6200, loss[loss=0.2212, simple_loss=0.3018, pruned_loss=0.07031, over 21858.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2874, pruned_loss=0.06582, over 4270200.86 frames. ], batch size: 118, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:07:25,602 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.569e+02 3.119e+02 3.567e+02 5.212e+02, threshold=6.237e+02, percent-clipped=0.0 2023-06-24 16:07:35,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135062.0, ans=0.1 2023-06-24 16:07:42,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.25 vs. limit=15.0 2023-06-24 16:07:49,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1135122.0, ans=0.125 2023-06-24 16:08:55,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1135302.0, ans=0.125 2023-06-24 16:08:56,817 INFO [train.py:996] (3/4) Epoch 7, batch 6250, loss[loss=0.1982, simple_loss=0.2875, pruned_loss=0.05448, over 21792.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2912, pruned_loss=0.06566, over 4271699.71 frames. ], batch size: 282, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:09:24,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.55 vs. limit=6.0 2023-06-24 16:09:25,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1135362.0, ans=0.95 2023-06-24 16:09:27,155 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:09:59,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1135422.0, ans=0.125 2023-06-24 16:10:31,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1135542.0, ans=0.0 2023-06-24 16:10:51,869 INFO [train.py:996] (3/4) Epoch 7, batch 6300, loss[loss=0.2339, simple_loss=0.3008, pruned_loss=0.08349, over 21330.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2945, pruned_loss=0.06568, over 4272572.74 frames. ], batch size: 159, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:10:56,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1135602.0, ans=0.0 2023-06-24 16:10:59,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1135602.0, ans=0.125 2023-06-24 16:11:04,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1135602.0, ans=0.125 2023-06-24 16:11:06,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.925e+02 2.617e+02 3.122e+02 4.088e+02 6.551e+02, threshold=6.244e+02, percent-clipped=1.0 2023-06-24 16:11:13,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135662.0, ans=0.1 2023-06-24 16:11:48,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1135722.0, ans=0.0 2023-06-24 16:12:20,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1135842.0, ans=0.125 2023-06-24 16:12:40,739 INFO [train.py:996] (3/4) Epoch 7, batch 6350, loss[loss=0.2263, simple_loss=0.2927, pruned_loss=0.07995, over 21440.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2981, pruned_loss=0.07051, over 4279842.78 frames. ], batch size: 211, lr: 4.41e-03, grad_scale: 8.0 2023-06-24 16:12:49,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.36 vs. limit=10.0 2023-06-24 16:13:16,955 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.07 vs. limit=15.0 2023-06-24 16:13:48,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136082.0, ans=0.1 2023-06-24 16:14:08,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1136082.0, ans=0.07 2023-06-24 16:14:30,409 INFO [train.py:996] (3/4) Epoch 7, batch 6400, loss[loss=0.2356, simple_loss=0.3127, pruned_loss=0.07928, over 21872.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3033, pruned_loss=0.07412, over 4281719.55 frames. ], batch size: 124, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:14:46,401 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.11 vs. limit=22.5 2023-06-24 16:14:55,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.966e+02 3.361e+02 3.840e+02 6.220e+02, threshold=6.721e+02, percent-clipped=0.0 2023-06-24 16:16:25,896 INFO [train.py:996] (3/4) Epoch 7, batch 6450, loss[loss=0.1831, simple_loss=0.2712, pruned_loss=0.0475, over 16290.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3065, pruned_loss=0.07406, over 4279576.70 frames. ], batch size: 63, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:16:55,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1136562.0, ans=0.09899494936611666 2023-06-24 16:16:59,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.94 vs. limit=6.0 2023-06-24 16:17:34,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1136682.0, ans=0.2 2023-06-24 16:18:14,952 INFO [train.py:996] (3/4) Epoch 7, batch 6500, loss[loss=0.1833, simple_loss=0.2639, pruned_loss=0.05138, over 21587.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2999, pruned_loss=0.07254, over 4279630.59 frames. ], batch size: 230, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:18:24,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-24 16:18:26,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1136802.0, ans=0.2 2023-06-24 16:18:35,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136802.0, ans=0.1 2023-06-24 16:18:38,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.852e+02 3.600e+02 4.849e+02 8.797e+02, threshold=7.199e+02, percent-clipped=3.0 2023-06-24 16:19:10,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1136922.0, ans=0.125 2023-06-24 16:19:14,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-24 16:19:58,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1137042.0, ans=0.2 2023-06-24 16:20:03,505 INFO [train.py:996] (3/4) Epoch 7, batch 6550, loss[loss=0.2572, simple_loss=0.3295, pruned_loss=0.09244, over 21865.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2997, pruned_loss=0.0712, over 4279118.92 frames. ], batch size: 107, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:20:12,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1137102.0, ans=0.09899494936611666 2023-06-24 16:20:47,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1137222.0, ans=0.125 2023-06-24 16:21:24,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1137282.0, ans=0.125 2023-06-24 16:21:53,184 INFO [train.py:996] (3/4) Epoch 7, batch 6600, loss[loss=0.1887, simple_loss=0.2508, pruned_loss=0.06325, over 21594.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.293, pruned_loss=0.07034, over 4276424.76 frames. ], batch size: 247, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:22:17,167 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.518e+02 2.917e+02 3.263e+02 5.305e+02, threshold=5.833e+02, percent-clipped=0.0 2023-06-24 16:23:03,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1137582.0, ans=0.1 2023-06-24 16:23:13,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1137582.0, ans=0.0 2023-06-24 16:23:53,021 INFO [train.py:996] (3/4) Epoch 7, batch 6650, loss[loss=0.2138, simple_loss=0.282, pruned_loss=0.07282, over 21969.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2856, pruned_loss=0.06832, over 4274496.35 frames. ], batch size: 103, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:24:00,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1137702.0, ans=0.125 2023-06-24 16:25:19,639 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:25:39,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.62 vs. limit=15.0 2023-06-24 16:25:43,956 INFO [train.py:996] (3/4) Epoch 7, batch 6700, loss[loss=0.204, simple_loss=0.2487, pruned_loss=0.07969, over 20414.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.281, pruned_loss=0.06761, over 4266057.92 frames. ], batch size: 703, lr: 4.41e-03, grad_scale: 16.0 2023-06-24 16:25:57,323 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.898e+02 2.457e+02 2.786e+02 3.230e+02 4.297e+02, threshold=5.572e+02, percent-clipped=0.0 2023-06-24 16:26:52,165 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:27:04,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1138242.0, ans=0.125 2023-06-24 16:27:26,644 INFO [train.py:996] (3/4) Epoch 7, batch 6750, loss[loss=0.23, simple_loss=0.2918, pruned_loss=0.08413, over 21812.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2788, pruned_loss=0.06757, over 4270127.54 frames. ], batch size: 282, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:27:32,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1138302.0, ans=0.125 2023-06-24 16:27:36,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1138302.0, ans=0.125 2023-06-24 16:27:44,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1138362.0, ans=0.1 2023-06-24 16:28:01,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1138422.0, ans=0.125 2023-06-24 16:28:19,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1138422.0, ans=0.125 2023-06-24 16:28:27,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-24 16:29:09,298 INFO [train.py:996] (3/4) Epoch 7, batch 6800, loss[loss=0.1925, simple_loss=0.2617, pruned_loss=0.06162, over 21624.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2816, pruned_loss=0.06994, over 4265941.73 frames. ], batch size: 263, lr: 4.40e-03, grad_scale: 32.0 2023-06-24 16:29:23,223 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.205e+02 2.710e+02 3.194e+02 3.747e+02 5.784e+02, threshold=6.389e+02, percent-clipped=2.0 2023-06-24 16:29:29,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-24 16:29:36,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1138662.0, ans=0.2 2023-06-24 16:29:50,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1138722.0, ans=0.1 2023-06-24 16:30:35,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-24 16:30:51,555 INFO [train.py:996] (3/4) Epoch 7, batch 6850, loss[loss=0.2027, simple_loss=0.269, pruned_loss=0.0682, over 21669.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2803, pruned_loss=0.07056, over 4271283.28 frames. ], batch size: 264, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:31:57,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1139082.0, ans=0.125 2023-06-24 16:32:07,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=22.5 2023-06-24 16:32:41,622 INFO [train.py:996] (3/4) Epoch 7, batch 6900, loss[loss=0.2452, simple_loss=0.305, pruned_loss=0.09269, over 21765.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2823, pruned_loss=0.07079, over 4281200.87 frames. ], batch size: 441, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:33:00,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1139202.0, ans=0.1 2023-06-24 16:33:03,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.809e+02 3.309e+02 4.065e+02 7.013e+02, threshold=6.619e+02, percent-clipped=1.0 2023-06-24 16:33:09,256 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:33:35,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1139322.0, ans=0.1 2023-06-24 16:34:09,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.32 vs. limit=15.0 2023-06-24 16:34:37,872 INFO [train.py:996] (3/4) Epoch 7, batch 6950, loss[loss=0.2368, simple_loss=0.3163, pruned_loss=0.0787, over 21553.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2849, pruned_loss=0.06802, over 4284377.62 frames. ], batch size: 211, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:34:38,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1139502.0, ans=0.5 2023-06-24 16:34:59,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1139562.0, ans=0.2 2023-06-24 16:35:26,434 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:35:34,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-24 16:36:27,743 INFO [train.py:996] (3/4) Epoch 7, batch 7000, loss[loss=0.2152, simple_loss=0.2757, pruned_loss=0.07732, over 21181.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2884, pruned_loss=0.071, over 4290324.58 frames. ], batch size: 143, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:36:28,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1139802.0, ans=0.125 2023-06-24 16:36:49,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.743e+02 2.855e+02 3.392e+02 4.148e+02 6.941e+02, threshold=6.785e+02, percent-clipped=1.0 2023-06-24 16:36:57,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1139862.0, ans=0.1 2023-06-24 16:36:59,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1139862.0, ans=0.1 2023-06-24 16:37:13,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1139922.0, ans=0.125 2023-06-24 16:37:49,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1139982.0, ans=0.0 2023-06-24 16:37:51,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1139982.0, ans=0.05 2023-06-24 16:38:13,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1140042.0, ans=0.125 2023-06-24 16:38:18,588 INFO [train.py:996] (3/4) Epoch 7, batch 7050, loss[loss=0.1848, simple_loss=0.2713, pruned_loss=0.04916, over 21797.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2865, pruned_loss=0.07022, over 4282076.36 frames. ], batch size: 282, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:38:24,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1140102.0, ans=0.2 2023-06-24 16:38:47,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1140162.0, ans=0.0 2023-06-24 16:38:54,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1140162.0, ans=0.125 2023-06-24 16:38:56,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1140162.0, ans=0.05 2023-06-24 16:39:56,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1140342.0, ans=0.125 2023-06-24 16:39:58,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-24 16:40:15,906 INFO [train.py:996] (3/4) Epoch 7, batch 7100, loss[loss=0.2332, simple_loss=0.3162, pruned_loss=0.07514, over 21786.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.292, pruned_loss=0.07217, over 4282319.68 frames. ], batch size: 124, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:40:23,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1140402.0, ans=0.2 2023-06-24 16:40:29,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=15.0 2023-06-24 16:40:31,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.792e+02 3.207e+02 3.771e+02 5.994e+02, threshold=6.414e+02, percent-clipped=0.0 2023-06-24 16:40:43,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1140462.0, ans=0.125 2023-06-24 16:41:03,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1140522.0, ans=0.125 2023-06-24 16:42:06,511 INFO [train.py:996] (3/4) Epoch 7, batch 7150, loss[loss=0.1549, simple_loss=0.2298, pruned_loss=0.03995, over 21239.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.289, pruned_loss=0.06989, over 4274385.67 frames. ], batch size: 176, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:42:37,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1140762.0, ans=0.125 2023-06-24 16:43:16,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1140882.0, ans=0.0 2023-06-24 16:43:17,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1140882.0, ans=0.05 2023-06-24 16:43:56,405 INFO [train.py:996] (3/4) Epoch 7, batch 7200, loss[loss=0.2228, simple_loss=0.2913, pruned_loss=0.0771, over 21336.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2905, pruned_loss=0.07103, over 4272702.18 frames. ], batch size: 131, lr: 4.40e-03, grad_scale: 32.0 2023-06-24 16:44:12,344 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 2.840e+02 3.235e+02 4.044e+02 5.731e+02, threshold=6.469e+02, percent-clipped=0.0 2023-06-24 16:44:15,484 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-24 16:44:23,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1141062.0, ans=0.1 2023-06-24 16:45:15,680 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:45:35,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1141242.0, ans=0.125 2023-06-24 16:45:37,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1141242.0, ans=0.125 2023-06-24 16:45:43,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1141302.0, ans=0.0 2023-06-24 16:45:45,334 INFO [train.py:996] (3/4) Epoch 7, batch 7250, loss[loss=0.1928, simple_loss=0.2557, pruned_loss=0.06498, over 21595.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2868, pruned_loss=0.07102, over 4272105.25 frames. ], batch size: 247, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:45:50,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1141302.0, ans=0.125 2023-06-24 16:45:52,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1141302.0, ans=0.125 2023-06-24 16:46:21,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-24 16:46:50,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1141422.0, ans=0.1 2023-06-24 16:46:55,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.38 vs. limit=10.0 2023-06-24 16:47:28,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-24 16:47:34,241 INFO [train.py:996] (3/4) Epoch 7, batch 7300, loss[loss=0.1807, simple_loss=0.2406, pruned_loss=0.06036, over 21429.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2806, pruned_loss=0.07034, over 4272908.44 frames. ], batch size: 212, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:47:45,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-24 16:47:51,209 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 2.579e+02 3.088e+02 3.610e+02 6.583e+02, threshold=6.177e+02, percent-clipped=0.0 2023-06-24 16:47:59,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1141662.0, ans=0.1 2023-06-24 16:48:41,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1141782.0, ans=0.2 2023-06-24 16:49:25,130 INFO [train.py:996] (3/4) Epoch 7, batch 7350, loss[loss=0.2307, simple_loss=0.2903, pruned_loss=0.08557, over 21249.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2785, pruned_loss=0.07008, over 4270897.93 frames. ], batch size: 159, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:50:18,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1142022.0, ans=0.5 2023-06-24 16:50:51,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1142082.0, ans=0.0 2023-06-24 16:51:07,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1142142.0, ans=0.125 2023-06-24 16:51:11,763 INFO [train.py:996] (3/4) Epoch 7, batch 7400, loss[loss=0.2488, simple_loss=0.3248, pruned_loss=0.0864, over 21620.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2855, pruned_loss=0.07242, over 4273170.20 frames. ], batch size: 389, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:51:14,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1142202.0, ans=0.1 2023-06-24 16:51:41,593 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.851e+02 3.315e+02 4.181e+02 6.542e+02, threshold=6.630e+02, percent-clipped=3.0 2023-06-24 16:52:06,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1142322.0, ans=0.0 2023-06-24 16:52:08,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1142322.0, ans=0.125 2023-06-24 16:52:15,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1142322.0, ans=0.5 2023-06-24 16:52:19,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1142322.0, ans=0.1 2023-06-24 16:52:31,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1142382.0, ans=0.1 2023-06-24 16:52:52,227 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.53 vs. limit=10.0 2023-06-24 16:53:03,469 INFO [train.py:996] (3/4) Epoch 7, batch 7450, loss[loss=0.2185, simple_loss=0.2954, pruned_loss=0.07082, over 21532.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.285, pruned_loss=0.07136, over 4278257.07 frames. ], batch size: 441, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:53:04,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1142502.0, ans=0.1 2023-06-24 16:53:28,519 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-24 16:53:43,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1142562.0, ans=0.0 2023-06-24 16:54:27,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2023-06-24 16:54:58,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-24 16:55:06,450 INFO [train.py:996] (3/4) Epoch 7, batch 7500, loss[loss=0.3139, simple_loss=0.3979, pruned_loss=0.115, over 21475.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2905, pruned_loss=0.07347, over 4282855.22 frames. ], batch size: 471, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:55:29,612 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.030e+02 3.534e+02 4.560e+02 9.672e+02, threshold=7.067e+02, percent-clipped=6.0 2023-06-24 16:56:54,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1143042.0, ans=0.0 2023-06-24 16:56:55,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1143102.0, ans=0.04949747468305833 2023-06-24 16:56:56,942 INFO [train.py:996] (3/4) Epoch 7, batch 7550, loss[loss=0.2122, simple_loss=0.3126, pruned_loss=0.05592, over 21765.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2984, pruned_loss=0.07207, over 4283250.64 frames. ], batch size: 351, lr: 4.40e-03, grad_scale: 16.0 2023-06-24 16:57:34,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1143162.0, ans=0.5 2023-06-24 16:58:41,040 INFO [train.py:996] (3/4) Epoch 7, batch 7600, loss[loss=0.2381, simple_loss=0.3074, pruned_loss=0.08442, over 21778.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2969, pruned_loss=0.07087, over 4288790.75 frames. ], batch size: 441, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 16:58:56,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1143402.0, ans=0.125 2023-06-24 16:59:09,488 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.834e+02 3.229e+02 4.103e+02 6.859e+02, threshold=6.458e+02, percent-clipped=0.0 2023-06-24 16:59:10,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1143462.0, ans=0.125 2023-06-24 16:59:59,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-24 17:00:00,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1143582.0, ans=0.0 2023-06-24 17:00:36,330 INFO [train.py:996] (3/4) Epoch 7, batch 7650, loss[loss=0.2565, simple_loss=0.3249, pruned_loss=0.09402, over 21866.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2954, pruned_loss=0.07211, over 4294943.13 frames. ], batch size: 107, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:00:57,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1143762.0, ans=0.125 2023-06-24 17:01:20,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1143762.0, ans=0.125 2023-06-24 17:01:24,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1143822.0, ans=0.0 2023-06-24 17:02:28,489 INFO [train.py:996] (3/4) Epoch 7, batch 7700, loss[loss=0.2678, simple_loss=0.3451, pruned_loss=0.09529, over 21493.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3002, pruned_loss=0.07545, over 4289931.07 frames. ], batch size: 131, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:02:53,672 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.786e+02 3.159e+02 3.961e+02 6.423e+02, threshold=6.319e+02, percent-clipped=0.0 2023-06-24 17:03:38,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-24 17:03:44,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1144182.0, ans=0.125 2023-06-24 17:04:28,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1144302.0, ans=0.0 2023-06-24 17:04:29,076 INFO [train.py:996] (3/4) Epoch 7, batch 7750, loss[loss=0.2446, simple_loss=0.3207, pruned_loss=0.08422, over 21262.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3064, pruned_loss=0.07617, over 4285162.34 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:06:09,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-24 17:06:27,892 INFO [train.py:996] (3/4) Epoch 7, batch 7800, loss[loss=0.1937, simple_loss=0.2647, pruned_loss=0.06136, over 21466.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3079, pruned_loss=0.07696, over 4284580.94 frames. ], batch size: 212, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:06:47,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.311e+02 4.032e+02 5.871e+02 9.097e+02, threshold=8.064e+02, percent-clipped=12.0 2023-06-24 17:06:48,546 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-24 17:07:08,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1144722.0, ans=0.0 2023-06-24 17:07:27,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1144782.0, ans=0.125 2023-06-24 17:08:11,744 INFO [train.py:996] (3/4) Epoch 7, batch 7850, loss[loss=0.1938, simple_loss=0.2606, pruned_loss=0.06357, over 21540.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2998, pruned_loss=0.0761, over 4278367.24 frames. ], batch size: 212, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:09:18,658 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=15.0 2023-06-24 17:09:45,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1145142.0, ans=0.125 2023-06-24 17:10:10,664 INFO [train.py:996] (3/4) Epoch 7, batch 7900, loss[loss=0.216, simple_loss=0.2907, pruned_loss=0.07065, over 21255.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2955, pruned_loss=0.07545, over 4282799.56 frames. ], batch size: 548, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:10:16,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1145202.0, ans=0.0 2023-06-24 17:10:20,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1145202.0, ans=0.125 2023-06-24 17:10:24,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1145202.0, ans=0.1 2023-06-24 17:10:28,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1145262.0, ans=0.125 2023-06-24 17:10:30,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.893e+02 3.310e+02 4.075e+02 8.177e+02, threshold=6.621e+02, percent-clipped=1.0 2023-06-24 17:11:40,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1145382.0, ans=0.125 2023-06-24 17:11:44,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1145442.0, ans=0.125 2023-06-24 17:12:02,893 INFO [train.py:996] (3/4) Epoch 7, batch 7950, loss[loss=0.1992, simple_loss=0.329, pruned_loss=0.03473, over 19771.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2997, pruned_loss=0.07414, over 4268247.13 frames. ], batch size: 702, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:13:52,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=15.0 2023-06-24 17:13:54,791 INFO [train.py:996] (3/4) Epoch 7, batch 8000, loss[loss=0.2585, simple_loss=0.3297, pruned_loss=0.09363, over 21273.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3038, pruned_loss=0.07602, over 4266356.46 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:14:21,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1145862.0, ans=0.1 2023-06-24 17:14:22,381 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 2.777e+02 3.258e+02 3.899e+02 6.990e+02, threshold=6.515e+02, percent-clipped=3.0 2023-06-24 17:15:15,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1145982.0, ans=15.0 2023-06-24 17:15:57,252 INFO [train.py:996] (3/4) Epoch 7, batch 8050, loss[loss=0.183, simple_loss=0.2361, pruned_loss=0.06495, over 21722.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.305, pruned_loss=0.07614, over 4259297.90 frames. ], batch size: 124, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:16:25,347 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-06-24 17:17:33,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1146342.0, ans=0.125 2023-06-24 17:17:45,815 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-24 17:17:48,329 INFO [train.py:996] (3/4) Epoch 7, batch 8100, loss[loss=0.2286, simple_loss=0.3033, pruned_loss=0.07696, over 21530.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3052, pruned_loss=0.07702, over 4270434.33 frames. ], batch size: 131, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:18:20,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1146462.0, ans=0.0 2023-06-24 17:18:21,655 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 3.042e+02 3.840e+02 5.397e+02 9.623e+02, threshold=7.680e+02, percent-clipped=13.0 2023-06-24 17:18:48,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1146522.0, ans=0.1 2023-06-24 17:19:40,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1146642.0, ans=0.125 2023-06-24 17:19:53,015 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-24 17:19:55,076 INFO [train.py:996] (3/4) Epoch 7, batch 8150, loss[loss=0.2579, simple_loss=0.3591, pruned_loss=0.07833, over 21633.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3082, pruned_loss=0.07738, over 4268749.64 frames. ], batch size: 414, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:20:00,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1146702.0, ans=0.1 2023-06-24 17:20:08,762 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-24 17:20:10,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1146702.0, ans=0.0 2023-06-24 17:20:14,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1146762.0, ans=0.125 2023-06-24 17:21:09,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1146882.0, ans=0.0 2023-06-24 17:21:44,381 INFO [train.py:996] (3/4) Epoch 7, batch 8200, loss[loss=0.2105, simple_loss=0.2767, pruned_loss=0.07214, over 21887.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3006, pruned_loss=0.07418, over 4259657.11 frames. ], batch size: 373, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:22:06,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.961e+02 3.959e+02 5.617e+02 1.113e+03, threshold=7.919e+02, percent-clipped=3.0 2023-06-24 17:22:12,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.56 vs. limit=15.0 2023-06-24 17:22:15,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1147062.0, ans=0.0 2023-06-24 17:22:25,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1147122.0, ans=0.125 2023-06-24 17:22:51,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1147182.0, ans=0.1 2023-06-24 17:23:29,572 INFO [train.py:996] (3/4) Epoch 7, batch 8250, loss[loss=0.2191, simple_loss=0.2972, pruned_loss=0.07055, over 21238.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3012, pruned_loss=0.07508, over 4259811.28 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:23:55,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=12.0 2023-06-24 17:24:29,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1147482.0, ans=0.125 2023-06-24 17:24:30,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=12.0 2023-06-24 17:25:22,778 INFO [train.py:996] (3/4) Epoch 7, batch 8300, loss[loss=0.1732, simple_loss=0.2509, pruned_loss=0.0477, over 21756.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3002, pruned_loss=0.07222, over 4265484.07 frames. ], batch size: 124, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:25:43,619 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.968e+02 2.710e+02 3.107e+02 3.703e+02 5.803e+02, threshold=6.215e+02, percent-clipped=0.0 2023-06-24 17:26:32,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1147782.0, ans=0.125 2023-06-24 17:26:42,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1147782.0, ans=0.125 2023-06-24 17:27:12,191 INFO [train.py:996] (3/4) Epoch 7, batch 8350, loss[loss=0.2112, simple_loss=0.3037, pruned_loss=0.05936, over 21713.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3007, pruned_loss=0.07114, over 4274570.44 frames. ], batch size: 351, lr: 4.39e-03, grad_scale: 16.0 2023-06-24 17:27:20,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1147902.0, ans=0.125 2023-06-24 17:27:34,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1147962.0, ans=0.125 2023-06-24 17:27:38,121 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:27:41,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1147962.0, ans=0.2 2023-06-24 17:28:23,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.83 vs. limit=10.0 2023-06-24 17:29:03,704 INFO [train.py:996] (3/4) Epoch 7, batch 8400, loss[loss=0.2045, simple_loss=0.3, pruned_loss=0.05444, over 21442.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2981, pruned_loss=0.06825, over 4273314.51 frames. ], batch size: 471, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:29:25,438 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.527e+02 3.220e+02 3.909e+02 1.035e+03, threshold=6.440e+02, percent-clipped=5.0 2023-06-24 17:29:44,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1148322.0, ans=0.0 2023-06-24 17:29:58,271 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=12.0 2023-06-24 17:30:01,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1148382.0, ans=0.2 2023-06-24 17:30:13,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1148382.0, ans=0.125 2023-06-24 17:30:29,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1148442.0, ans=0.125 2023-06-24 17:30:47,862 INFO [train.py:996] (3/4) Epoch 7, batch 8450, loss[loss=0.2136, simple_loss=0.2822, pruned_loss=0.07254, over 21808.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2961, pruned_loss=0.06702, over 4275935.09 frames. ], batch size: 371, lr: 4.39e-03, grad_scale: 32.0 2023-06-24 17:32:03,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1148682.0, ans=0.125 2023-06-24 17:32:32,091 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.17 vs. limit=10.0 2023-06-24 17:32:36,618 INFO [train.py:996] (3/4) Epoch 7, batch 8500, loss[loss=0.1913, simple_loss=0.2559, pruned_loss=0.06335, over 21402.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2922, pruned_loss=0.06826, over 4280212.77 frames. ], batch size: 211, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:32:47,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1148802.0, ans=0.125 2023-06-24 17:32:57,144 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 2.839e+02 3.413e+02 4.005e+02 7.078e+02, threshold=6.826e+02, percent-clipped=2.0 2023-06-24 17:33:07,758 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.83 vs. limit=22.5 2023-06-24 17:33:21,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=22.5 2023-06-24 17:34:22,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=22.5 2023-06-24 17:34:26,833 INFO [train.py:996] (3/4) Epoch 7, batch 8550, loss[loss=0.2238, simple_loss=0.3014, pruned_loss=0.07308, over 21625.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2972, pruned_loss=0.07213, over 4277532.89 frames. ], batch size: 263, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:34:30,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1149102.0, ans=0.125 2023-06-24 17:34:34,846 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:35:10,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1149222.0, ans=0.02 2023-06-24 17:35:35,593 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=22.5 2023-06-24 17:35:58,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=12.0 2023-06-24 17:36:02,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1149342.0, ans=0.125 2023-06-24 17:36:13,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1149342.0, ans=0.125 2023-06-24 17:36:18,068 INFO [train.py:996] (3/4) Epoch 7, batch 8600, loss[loss=0.2669, simple_loss=0.3375, pruned_loss=0.09822, over 21790.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3047, pruned_loss=0.07433, over 4283752.38 frames. ], batch size: 118, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:36:40,532 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 3.018e+02 3.698e+02 4.926e+02 7.683e+02, threshold=7.396e+02, percent-clipped=5.0 2023-06-24 17:36:54,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1149462.0, ans=0.1 2023-06-24 17:37:06,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1149522.0, ans=0.0 2023-06-24 17:37:29,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1149582.0, ans=0.125 2023-06-24 17:37:58,815 INFO [train.py:996] (3/4) Epoch 7, batch 8650, loss[loss=0.1922, simple_loss=0.2954, pruned_loss=0.0445, over 21793.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3096, pruned_loss=0.07459, over 4274826.02 frames. ], batch size: 332, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:38:01,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1149702.0, ans=0.07 2023-06-24 17:38:21,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1149762.0, ans=0.2 2023-06-24 17:39:10,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1149882.0, ans=10.0 2023-06-24 17:39:22,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1149882.0, ans=0.0 2023-06-24 17:39:41,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1150002.0, ans=0.1 2023-06-24 17:39:42,717 INFO [train.py:996] (3/4) Epoch 7, batch 8700, loss[loss=0.1844, simple_loss=0.256, pruned_loss=0.05636, over 21616.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3028, pruned_loss=0.07122, over 4278522.89 frames. ], batch size: 298, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:39:54,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1150002.0, ans=0.2 2023-06-24 17:40:08,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1150062.0, ans=0.125 2023-06-24 17:40:09,729 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.692e+02 2.588e+02 3.028e+02 3.644e+02 6.697e+02, threshold=6.057e+02, percent-clipped=0.0 2023-06-24 17:41:07,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-24 17:41:10,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1150182.0, ans=0.0 2023-06-24 17:41:30,715 INFO [train.py:996] (3/4) Epoch 7, batch 8750, loss[loss=0.2308, simple_loss=0.3537, pruned_loss=0.05392, over 19903.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3003, pruned_loss=0.07134, over 4269634.24 frames. ], batch size: 702, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:41:53,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1150362.0, ans=0.0 2023-06-24 17:42:20,882 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.15 vs. limit=10.0 2023-06-24 17:42:26,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1150422.0, ans=15.0 2023-06-24 17:42:40,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1150422.0, ans=0.125 2023-06-24 17:43:22,464 INFO [train.py:996] (3/4) Epoch 7, batch 8800, loss[loss=0.2656, simple_loss=0.3794, pruned_loss=0.07593, over 21244.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3084, pruned_loss=0.07431, over 4274180.84 frames. ], batch size: 548, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:43:36,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1150602.0, ans=0.1 2023-06-24 17:44:01,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1150662.0, ans=0.2 2023-06-24 17:44:02,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.059e+02 3.780e+02 4.742e+02 8.855e+02, threshold=7.560e+02, percent-clipped=10.0 2023-06-24 17:44:04,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1150662.0, ans=0.125 2023-06-24 17:44:35,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1150722.0, ans=0.125 2023-06-24 17:44:35,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1150722.0, ans=0.1 2023-06-24 17:45:10,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1150842.0, ans=0.05 2023-06-24 17:45:24,891 INFO [train.py:996] (3/4) Epoch 7, batch 8850, loss[loss=0.2364, simple_loss=0.3251, pruned_loss=0.07382, over 21629.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3138, pruned_loss=0.07737, over 4272650.19 frames. ], batch size: 414, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:45:25,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1150902.0, ans=0.1 2023-06-24 17:46:04,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1150962.0, ans=0.0 2023-06-24 17:46:27,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1151082.0, ans=0.2 2023-06-24 17:47:16,901 INFO [train.py:996] (3/4) Epoch 7, batch 8900, loss[loss=0.2634, simple_loss=0.3322, pruned_loss=0.09728, over 21402.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3086, pruned_loss=0.07648, over 4266809.31 frames. ], batch size: 507, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:47:41,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1151202.0, ans=0.125 2023-06-24 17:47:52,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1151262.0, ans=0.0 2023-06-24 17:47:52,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1151262.0, ans=0.125 2023-06-24 17:47:54,310 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.946e+02 3.604e+02 5.046e+02 1.118e+03, threshold=7.207e+02, percent-clipped=3.0 2023-06-24 17:48:02,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1151262.0, ans=0.125 2023-06-24 17:48:07,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1151322.0, ans=0.125 2023-06-24 17:48:11,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1151322.0, ans=0.2 2023-06-24 17:48:13,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1151322.0, ans=0.125 2023-06-24 17:48:20,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1151322.0, ans=0.2 2023-06-24 17:48:26,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1151382.0, ans=0.0 2023-06-24 17:48:33,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1151382.0, ans=0.2 2023-06-24 17:48:35,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1151382.0, ans=0.125 2023-06-24 17:49:21,203 INFO [train.py:996] (3/4) Epoch 7, batch 8950, loss[loss=0.2631, simple_loss=0.3482, pruned_loss=0.089, over 21654.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3074, pruned_loss=0.07503, over 4265004.89 frames. ], batch size: 389, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:51:09,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=22.5 2023-06-24 17:51:10,367 INFO [train.py:996] (3/4) Epoch 7, batch 9000, loss[loss=0.2013, simple_loss=0.2651, pruned_loss=0.06875, over 21645.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3027, pruned_loss=0.0743, over 4261050.94 frames. ], batch size: 332, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:51:10,367 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 17:51:28,285 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2657, simple_loss=0.3576, pruned_loss=0.0869, over 1796401.00 frames. 2023-06-24 17:51:28,286 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23409MB 2023-06-24 17:51:53,793 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.929e+02 3.694e+02 4.955e+02 7.799e+02, threshold=7.388e+02, percent-clipped=3.0 2023-06-24 17:52:05,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.89 vs. limit=10.0 2023-06-24 17:53:11,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1152042.0, ans=0.2 2023-06-24 17:53:14,994 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:53:21,720 INFO [train.py:996] (3/4) Epoch 7, batch 9050, loss[loss=0.2199, simple_loss=0.2955, pruned_loss=0.0722, over 21391.00 frames. ], tot_loss[loss=0.22, simple_loss=0.297, pruned_loss=0.07154, over 4269387.49 frames. ], batch size: 211, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:53:29,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1152102.0, ans=10.0 2023-06-24 17:54:59,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1152342.0, ans=0.0 2023-06-24 17:55:02,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1152342.0, ans=0.1 2023-06-24 17:55:14,791 INFO [train.py:996] (3/4) Epoch 7, batch 9100, loss[loss=0.1909, simple_loss=0.2476, pruned_loss=0.06709, over 19946.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3006, pruned_loss=0.07265, over 4270085.38 frames. ], batch size: 702, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:55:22,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1152402.0, ans=0.025 2023-06-24 17:55:45,275 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 2.655e+02 3.193e+02 3.861e+02 6.275e+02, threshold=6.386e+02, percent-clipped=0.0 2023-06-24 17:55:53,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1152462.0, ans=0.125 2023-06-24 17:55:55,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=22.5 2023-06-24 17:56:11,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1152522.0, ans=0.2 2023-06-24 17:56:54,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1152642.0, ans=0.125 2023-06-24 17:56:59,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1152702.0, ans=0.1 2023-06-24 17:57:01,006 INFO [train.py:996] (3/4) Epoch 7, batch 9150, loss[loss=0.1931, simple_loss=0.3149, pruned_loss=0.03563, over 20688.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3033, pruned_loss=0.07012, over 4268962.26 frames. ], batch size: 607, lr: 4.38e-03, grad_scale: 16.0 2023-06-24 17:57:14,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1152702.0, ans=0.0 2023-06-24 17:58:11,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1152822.0, ans=0.125 2023-06-24 17:58:26,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1152882.0, ans=0.125 2023-06-24 17:58:29,109 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:58:32,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1152882.0, ans=0.0 2023-06-24 17:58:58,732 INFO [train.py:996] (3/4) Epoch 7, batch 9200, loss[loss=0.1974, simple_loss=0.2959, pruned_loss=0.0495, over 21606.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3053, pruned_loss=0.06847, over 4264572.25 frames. ], batch size: 263, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 17:59:17,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1153062.0, ans=0.125 2023-06-24 17:59:29,536 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.740e+02 3.426e+02 4.320e+02 8.569e+02, threshold=6.853e+02, percent-clipped=6.0 2023-06-24 17:59:45,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1153122.0, ans=0.125 2023-06-24 18:00:23,841 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-24 18:00:39,519 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=15.0 2023-06-24 18:00:50,653 INFO [train.py:996] (3/4) Epoch 7, batch 9250, loss[loss=0.2204, simple_loss=0.2793, pruned_loss=0.08073, over 21456.00 frames. ], tot_loss[loss=0.226, simple_loss=0.308, pruned_loss=0.072, over 4271230.93 frames. ], batch size: 389, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 18:00:51,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1153302.0, ans=0.125 2023-06-24 18:02:12,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1153482.0, ans=0.125 2023-06-24 18:02:39,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1153542.0, ans=0.0 2023-06-24 18:02:42,901 INFO [train.py:996] (3/4) Epoch 7, batch 9300, loss[loss=0.1921, simple_loss=0.2633, pruned_loss=0.06047, over 21802.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3022, pruned_loss=0.07272, over 4261737.87 frames. ], batch size: 118, lr: 4.38e-03, grad_scale: 32.0 2023-06-24 18:02:46,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1153602.0, ans=0.125 2023-06-24 18:03:13,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.058e+02 3.549e+02 4.364e+02 7.419e+02, threshold=7.098e+02, percent-clipped=2.0 2023-06-24 18:03:24,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1153662.0, ans=0.125 2023-06-24 18:03:27,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.94 vs. limit=6.0 2023-06-24 18:03:31,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1153722.0, ans=0.0 2023-06-24 18:03:33,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1153722.0, ans=0.125 2023-06-24 18:03:43,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=22.5 2023-06-24 18:04:29,049 INFO [train.py:996] (3/4) Epoch 7, batch 9350, loss[loss=0.2409, simple_loss=0.3202, pruned_loss=0.0808, over 21823.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3072, pruned_loss=0.07337, over 4256994.69 frames. ], batch size: 282, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:04:45,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1153902.0, ans=0.125 2023-06-24 18:05:00,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1153902.0, ans=0.125 2023-06-24 18:05:08,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1153962.0, ans=0.0 2023-06-24 18:05:52,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1154082.0, ans=0.0 2023-06-24 18:06:04,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1154142.0, ans=0.0 2023-06-24 18:06:31,723 INFO [train.py:996] (3/4) Epoch 7, batch 9400, loss[loss=0.1874, simple_loss=0.2581, pruned_loss=0.05832, over 21405.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3094, pruned_loss=0.07472, over 4263223.09 frames. ], batch size: 211, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:06:53,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1154262.0, ans=0.125 2023-06-24 18:07:01,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1154262.0, ans=15.0 2023-06-24 18:07:02,152 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.873e+02 3.280e+02 3.858e+02 8.681e+02, threshold=6.561e+02, percent-clipped=2.0 2023-06-24 18:08:13,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1154442.0, ans=0.125 2023-06-24 18:08:21,784 INFO [train.py:996] (3/4) Epoch 7, batch 9450, loss[loss=0.2198, simple_loss=0.2792, pruned_loss=0.08024, over 21639.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3026, pruned_loss=0.07399, over 4266119.16 frames. ], batch size: 333, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:09:06,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1154622.0, ans=0.125 2023-06-24 18:10:10,152 INFO [train.py:996] (3/4) Epoch 7, batch 9500, loss[loss=0.1946, simple_loss=0.2777, pruned_loss=0.05572, over 21396.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.295, pruned_loss=0.07241, over 4257153.20 frames. ], batch size: 194, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:10:24,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1154802.0, ans=0.125 2023-06-24 18:10:42,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.886e+02 3.476e+02 4.165e+02 8.781e+02, threshold=6.953e+02, percent-clipped=4.0 2023-06-24 18:11:59,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-24 18:12:00,973 INFO [train.py:996] (3/4) Epoch 7, batch 9550, loss[loss=0.253, simple_loss=0.3363, pruned_loss=0.08488, over 21759.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3001, pruned_loss=0.07426, over 4258655.99 frames. ], batch size: 124, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:12:11,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1155102.0, ans=0.05 2023-06-24 18:12:22,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1155162.0, ans=0.1 2023-06-24 18:12:28,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1155162.0, ans=0.07 2023-06-24 18:12:31,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1155162.0, ans=0.125 2023-06-24 18:12:37,976 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:13:50,404 INFO [train.py:996] (3/4) Epoch 7, batch 9600, loss[loss=0.1893, simple_loss=0.2707, pruned_loss=0.05395, over 21830.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3033, pruned_loss=0.07545, over 4265840.20 frames. ], batch size: 298, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:14:23,109 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.053e+02 3.563e+02 4.666e+02 8.626e+02, threshold=7.126e+02, percent-clipped=5.0 2023-06-24 18:14:35,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1155522.0, ans=0.125 2023-06-24 18:14:48,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1155522.0, ans=0.5 2023-06-24 18:15:45,050 INFO [train.py:996] (3/4) Epoch 7, batch 9650, loss[loss=0.2264, simple_loss=0.3004, pruned_loss=0.0762, over 21925.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.302, pruned_loss=0.07364, over 4271547.01 frames. ], batch size: 316, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:15:56,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1155702.0, ans=0.125 2023-06-24 18:16:41,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=22.5 2023-06-24 18:16:58,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1155882.0, ans=0.02 2023-06-24 18:17:28,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1155942.0, ans=0.2 2023-06-24 18:17:34,789 INFO [train.py:996] (3/4) Epoch 7, batch 9700, loss[loss=0.223, simple_loss=0.2974, pruned_loss=0.07431, over 21508.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3062, pruned_loss=0.0742, over 4271683.30 frames. ], batch size: 548, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:18:05,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1156062.0, ans=0.04949747468305833 2023-06-24 18:18:08,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.706e+02 3.025e+02 3.673e+02 7.479e+02, threshold=6.049e+02, percent-clipped=1.0 2023-06-24 18:18:24,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1156122.0, ans=0.125 2023-06-24 18:18:35,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1156122.0, ans=0.125 2023-06-24 18:19:18,082 INFO [train.py:996] (3/4) Epoch 7, batch 9750, loss[loss=0.2491, simple_loss=0.2891, pruned_loss=0.1045, over 21430.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3002, pruned_loss=0.07364, over 4275373.48 frames. ], batch size: 509, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:19:50,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1156362.0, ans=0.0 2023-06-24 18:20:02,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1156422.0, ans=0.125 2023-06-24 18:20:13,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1156422.0, ans=0.125 2023-06-24 18:20:38,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1156542.0, ans=0.2 2023-06-24 18:20:41,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1156542.0, ans=0.125 2023-06-24 18:21:03,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1156542.0, ans=0.125 2023-06-24 18:21:07,435 INFO [train.py:996] (3/4) Epoch 7, batch 9800, loss[loss=0.2274, simple_loss=0.2922, pruned_loss=0.08129, over 21807.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.299, pruned_loss=0.07332, over 4277036.66 frames. ], batch size: 414, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:21:39,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.762e+02 3.059e+02 4.077e+02 6.018e+02, threshold=6.118e+02, percent-clipped=0.0 2023-06-24 18:22:00,128 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:22:23,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1156782.0, ans=0.125 2023-06-24 18:22:25,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1156782.0, ans=0.125 2023-06-24 18:22:55,800 INFO [train.py:996] (3/4) Epoch 7, batch 9850, loss[loss=0.2143, simple_loss=0.2969, pruned_loss=0.06583, over 16874.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2963, pruned_loss=0.07346, over 4278437.73 frames. ], batch size: 62, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:23:36,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1157022.0, ans=0.2 2023-06-24 18:23:49,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-24 18:24:12,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1157082.0, ans=0.0 2023-06-24 18:24:38,512 INFO [train.py:996] (3/4) Epoch 7, batch 9900, loss[loss=0.2385, simple_loss=0.3161, pruned_loss=0.08048, over 21468.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.294, pruned_loss=0.07353, over 4267692.07 frames. ], batch size: 211, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:24:47,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1157202.0, ans=0.1 2023-06-24 18:24:56,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1157262.0, ans=0.0 2023-06-24 18:25:11,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=12.0 2023-06-24 18:25:12,375 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.791e+02 3.369e+02 4.122e+02 6.726e+02, threshold=6.739e+02, percent-clipped=1.0 2023-06-24 18:25:52,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1157382.0, ans=0.125 2023-06-24 18:26:27,537 INFO [train.py:996] (3/4) Epoch 7, batch 9950, loss[loss=0.2389, simple_loss=0.2863, pruned_loss=0.0957, over 21256.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2962, pruned_loss=0.07599, over 4263834.16 frames. ], batch size: 471, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:26:31,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1157502.0, ans=0.0 2023-06-24 18:27:05,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1157622.0, ans=0.1 2023-06-24 18:27:24,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-24 18:27:33,480 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=22.5 2023-06-24 18:27:51,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1157682.0, ans=0.1 2023-06-24 18:28:14,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1157742.0, ans=0.125 2023-06-24 18:28:14,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1157742.0, ans=0.125 2023-06-24 18:28:15,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1157802.0, ans=0.0 2023-06-24 18:28:16,551 INFO [train.py:996] (3/4) Epoch 7, batch 10000, loss[loss=0.1632, simple_loss=0.2392, pruned_loss=0.04357, over 21550.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2926, pruned_loss=0.07487, over 4264586.88 frames. ], batch size: 230, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:28:49,775 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.643e+02 3.254e+02 4.440e+02 7.063e+02, threshold=6.507e+02, percent-clipped=1.0 2023-06-24 18:29:06,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1157922.0, ans=0.0 2023-06-24 18:29:22,800 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-24 18:29:35,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1157982.0, ans=0.0 2023-06-24 18:29:51,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1158042.0, ans=0.95 2023-06-24 18:29:53,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-24 18:30:04,071 INFO [train.py:996] (3/4) Epoch 7, batch 10050, loss[loss=0.2315, simple_loss=0.2965, pruned_loss=0.08323, over 20742.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2943, pruned_loss=0.07504, over 4266122.99 frames. ], batch size: 608, lr: 4.37e-03, grad_scale: 32.0 2023-06-24 18:30:39,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1158162.0, ans=0.125 2023-06-24 18:31:46,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1158342.0, ans=0.0 2023-06-24 18:32:01,195 INFO [train.py:996] (3/4) Epoch 7, batch 10100, loss[loss=0.2423, simple_loss=0.3061, pruned_loss=0.08925, over 21350.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2903, pruned_loss=0.07232, over 4274395.42 frames. ], batch size: 159, lr: 4.37e-03, grad_scale: 16.0 2023-06-24 18:32:30,797 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.650e+02 3.073e+02 3.822e+02 6.259e+02, threshold=6.145e+02, percent-clipped=0.0 2023-06-24 18:33:50,327 INFO [train.py:996] (3/4) Epoch 7, batch 10150, loss[loss=0.2115, simple_loss=0.28, pruned_loss=0.07152, over 21779.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2974, pruned_loss=0.07509, over 4279607.72 frames. ], batch size: 102, lr: 4.37e-03, grad_scale: 8.0 2023-06-24 18:33:51,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1158702.0, ans=0.2 2023-06-24 18:33:58,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1158702.0, ans=0.5 2023-06-24 18:35:39,219 INFO [train.py:996] (3/4) Epoch 7, batch 10200, loss[loss=0.1972, simple_loss=0.2935, pruned_loss=0.05046, over 21704.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2965, pruned_loss=0.07334, over 4277740.13 frames. ], batch size: 332, lr: 4.37e-03, grad_scale: 8.0 2023-06-24 18:35:55,972 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-24 18:35:57,686 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:36:17,254 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 2.567e+02 2.979e+02 3.564e+02 7.472e+02, threshold=5.959e+02, percent-clipped=1.0 2023-06-24 18:36:24,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1159122.0, ans=0.125 2023-06-24 18:36:49,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.59 vs. limit=22.5 2023-06-24 18:37:19,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-24 18:37:28,862 INFO [train.py:996] (3/4) Epoch 7, batch 10250, loss[loss=0.13, simple_loss=0.2045, pruned_loss=0.02774, over 21145.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2906, pruned_loss=0.06785, over 4273285.03 frames. ], batch size: 176, lr: 4.36e-03, grad_scale: 8.0 2023-06-24 18:37:38,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1159302.0, ans=0.0 2023-06-24 18:37:40,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1159302.0, ans=0.2 2023-06-24 18:37:40,950 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-24 18:37:45,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1159362.0, ans=0.125 2023-06-24 18:39:17,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-24 18:39:22,101 INFO [train.py:996] (3/4) Epoch 7, batch 10300, loss[loss=0.1884, simple_loss=0.238, pruned_loss=0.06943, over 20917.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2931, pruned_loss=0.06848, over 4267550.44 frames. ], batch size: 608, lr: 4.36e-03, grad_scale: 8.0 2023-06-24 18:39:37,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1159602.0, ans=0.2 2023-06-24 18:39:56,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1159662.0, ans=0.05 2023-06-24 18:40:00,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1159662.0, ans=0.2 2023-06-24 18:40:08,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1159662.0, ans=0.07 2023-06-24 18:40:11,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.687e+02 3.369e+02 4.671e+02 1.084e+03, threshold=6.737e+02, percent-clipped=9.0 2023-06-24 18:40:43,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1159782.0, ans=0.125 2023-06-24 18:40:48,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1159782.0, ans=0.0 2023-06-24 18:41:14,516 INFO [train.py:996] (3/4) Epoch 7, batch 10350, loss[loss=0.2195, simple_loss=0.2969, pruned_loss=0.07108, over 21698.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2957, pruned_loss=0.06919, over 4269280.15 frames. ], batch size: 351, lr: 4.36e-03, grad_scale: 8.0 2023-06-24 18:41:32,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-24 18:42:09,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1160022.0, ans=15.0 2023-06-24 18:42:35,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1160082.0, ans=0.0 2023-06-24 18:42:40,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1160082.0, ans=0.2 2023-06-24 18:42:42,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1160082.0, ans=0.0 2023-06-24 18:42:49,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1160142.0, ans=0.0 2023-06-24 18:42:58,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1160142.0, ans=0.125 2023-06-24 18:43:02,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1160142.0, ans=0.2 2023-06-24 18:43:12,828 INFO [train.py:996] (3/4) Epoch 7, batch 10400, loss[loss=0.1906, simple_loss=0.2635, pruned_loss=0.05885, over 21765.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.288, pruned_loss=0.0677, over 4267995.64 frames. ], batch size: 282, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:43:13,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1160202.0, ans=0.0 2023-06-24 18:43:21,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-06-24 18:43:24,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1160202.0, ans=0.1 2023-06-24 18:43:40,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1160262.0, ans=0.125 2023-06-24 18:43:56,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 2.812e+02 3.590e+02 4.501e+02 9.958e+02, threshold=7.181e+02, percent-clipped=6.0 2023-06-24 18:44:12,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1160322.0, ans=0.125 2023-06-24 18:44:40,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1160442.0, ans=0.0 2023-06-24 18:45:00,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1160442.0, ans=0.0 2023-06-24 18:45:15,893 INFO [train.py:996] (3/4) Epoch 7, batch 10450, loss[loss=0.2139, simple_loss=0.2918, pruned_loss=0.06805, over 21422.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2936, pruned_loss=0.07031, over 4261510.75 frames. ], batch size: 194, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:46:01,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1160622.0, ans=0.125 2023-06-24 18:46:35,199 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:47:06,304 INFO [train.py:996] (3/4) Epoch 7, batch 10500, loss[loss=0.2058, simple_loss=0.2722, pruned_loss=0.0697, over 21632.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2951, pruned_loss=0.07007, over 4260012.43 frames. ], batch size: 247, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:47:43,181 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.810e+02 3.423e+02 4.183e+02 6.636e+02, threshold=6.845e+02, percent-clipped=0.0 2023-06-24 18:48:06,364 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:48:54,918 INFO [train.py:996] (3/4) Epoch 7, batch 10550, loss[loss=0.1826, simple_loss=0.2406, pruned_loss=0.06233, over 21321.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2897, pruned_loss=0.06947, over 4260349.65 frames. ], batch size: 551, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:49:06,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-24 18:50:01,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1161282.0, ans=0.125 2023-06-24 18:50:46,823 INFO [train.py:996] (3/4) Epoch 7, batch 10600, loss[loss=0.1948, simple_loss=0.2533, pruned_loss=0.06817, over 15214.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2855, pruned_loss=0.06864, over 4248102.24 frames. ], batch size: 62, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:50:49,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-24 18:51:06,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1161462.0, ans=0.1 2023-06-24 18:51:16,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1161462.0, ans=0.125 2023-06-24 18:51:24,992 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 2.607e+02 2.934e+02 3.561e+02 5.999e+02, threshold=5.868e+02, percent-clipped=0.0 2023-06-24 18:51:30,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1161522.0, ans=0.1 2023-06-24 18:52:13,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1161582.0, ans=0.2 2023-06-24 18:52:38,840 INFO [train.py:996] (3/4) Epoch 7, batch 10650, loss[loss=0.1695, simple_loss=0.2446, pruned_loss=0.04725, over 21308.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2853, pruned_loss=0.06679, over 4248417.94 frames. ], batch size: 194, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:53:05,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1161762.0, ans=0.0 2023-06-24 18:53:47,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1161882.0, ans=0.125 2023-06-24 18:54:03,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1161882.0, ans=0.0 2023-06-24 18:54:29,873 INFO [train.py:996] (3/4) Epoch 7, batch 10700, loss[loss=0.16, simple_loss=0.233, pruned_loss=0.04346, over 21552.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.284, pruned_loss=0.06637, over 4258700.26 frames. ], batch size: 230, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:54:59,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1162062.0, ans=0.2 2023-06-24 18:55:02,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-24 18:55:08,596 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.935e+02 3.419e+02 4.511e+02 9.695e+02, threshold=6.839e+02, percent-clipped=12.0 2023-06-24 18:55:36,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-24 18:56:28,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1162302.0, ans=0.0 2023-06-24 18:56:29,681 INFO [train.py:996] (3/4) Epoch 7, batch 10750, loss[loss=0.256, simple_loss=0.3606, pruned_loss=0.07576, over 21322.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2946, pruned_loss=0.07036, over 4262500.91 frames. ], batch size: 548, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 18:57:37,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1162422.0, ans=0.125 2023-06-24 18:57:38,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-24 18:58:18,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1162542.0, ans=0.95 2023-06-24 18:58:21,623 INFO [train.py:996] (3/4) Epoch 7, batch 10800, loss[loss=0.1909, simple_loss=0.2762, pruned_loss=0.05277, over 20720.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3005, pruned_loss=0.07207, over 4269291.96 frames. ], batch size: 607, lr: 4.36e-03, grad_scale: 32.0 2023-06-24 18:58:26,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1162602.0, ans=0.0 2023-06-24 18:59:06,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 2.815e+02 3.156e+02 3.825e+02 7.344e+02, threshold=6.312e+02, percent-clipped=1.0 2023-06-24 18:59:24,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1162782.0, ans=0.0 2023-06-24 18:59:31,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1162782.0, ans=0.125 2023-06-24 19:00:01,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1162842.0, ans=0.125 2023-06-24 19:00:06,205 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=22.5 2023-06-24 19:00:07,152 INFO [train.py:996] (3/4) Epoch 7, batch 10850, loss[loss=0.1899, simple_loss=0.2588, pruned_loss=0.06049, over 21333.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3036, pruned_loss=0.07302, over 4261575.62 frames. ], batch size: 211, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:00:31,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1162902.0, ans=0.125 2023-06-24 19:00:48,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1162962.0, ans=0.125 2023-06-24 19:01:07,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.10 vs. limit=12.0 2023-06-24 19:01:22,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1163082.0, ans=0.125 2023-06-24 19:01:24,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1163082.0, ans=0.1 2023-06-24 19:01:48,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-24 19:02:04,082 INFO [train.py:996] (3/4) Epoch 7, batch 10900, loss[loss=0.2178, simple_loss=0.267, pruned_loss=0.08433, over 21414.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2956, pruned_loss=0.07072, over 4269798.65 frames. ], batch size: 475, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:02:47,849 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.711e+02 3.083e+02 3.861e+02 1.043e+03, threshold=6.166e+02, percent-clipped=5.0 2023-06-24 19:03:14,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1163382.0, ans=0.125 2023-06-24 19:03:39,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1163442.0, ans=0.0 2023-06-24 19:03:41,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1163442.0, ans=0.1 2023-06-24 19:03:53,383 INFO [train.py:996] (3/4) Epoch 7, batch 10950, loss[loss=0.2182, simple_loss=0.2798, pruned_loss=0.07828, over 21205.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2914, pruned_loss=0.06864, over 4268568.18 frames. ], batch size: 176, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:04:29,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1163562.0, ans=0.125 2023-06-24 19:04:42,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1163622.0, ans=0.125 2023-06-24 19:05:10,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1163682.0, ans=0.125 2023-06-24 19:05:15,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1163682.0, ans=0.125 2023-06-24 19:05:25,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1163742.0, ans=0.125 2023-06-24 19:05:40,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-24 19:05:42,596 INFO [train.py:996] (3/4) Epoch 7, batch 11000, loss[loss=0.2246, simple_loss=0.2942, pruned_loss=0.07746, over 21897.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2911, pruned_loss=0.07058, over 4280427.89 frames. ], batch size: 351, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:05:47,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1163802.0, ans=0.125 2023-06-24 19:06:06,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1163862.0, ans=0.0 2023-06-24 19:06:07,144 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.21 vs. limit=5.0 2023-06-24 19:06:13,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1163862.0, ans=0.125 2023-06-24 19:06:14,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1163862.0, ans=0.1 2023-06-24 19:06:14,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1163862.0, ans=0.0 2023-06-24 19:06:16,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1163862.0, ans=0.125 2023-06-24 19:06:19,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1163862.0, ans=0.125 2023-06-24 19:06:26,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 2.764e+02 3.110e+02 3.886e+02 6.584e+02, threshold=6.221e+02, percent-clipped=1.0 2023-06-24 19:06:29,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-24 19:07:08,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.08 vs. limit=15.0 2023-06-24 19:07:31,768 INFO [train.py:996] (3/4) Epoch 7, batch 11050, loss[loss=0.1837, simple_loss=0.2361, pruned_loss=0.06568, over 21248.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2889, pruned_loss=0.07131, over 4273140.78 frames. ], batch size: 548, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:08:11,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1164162.0, ans=0.125 2023-06-24 19:08:20,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1164222.0, ans=0.125 2023-06-24 19:08:31,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1164282.0, ans=0.125 2023-06-24 19:08:57,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1164342.0, ans=0.05 2023-06-24 19:09:17,949 INFO [train.py:996] (3/4) Epoch 7, batch 11100, loss[loss=0.2416, simple_loss=0.2909, pruned_loss=0.09613, over 21256.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2882, pruned_loss=0.07157, over 4279813.01 frames. ], batch size: 471, lr: 4.36e-03, grad_scale: 16.0 2023-06-24 19:09:24,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1164402.0, ans=0.125 2023-06-24 19:10:00,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.678e+02 3.103e+02 3.561e+02 5.692e+02, threshold=6.205e+02, percent-clipped=0.0 2023-06-24 19:10:27,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1164582.0, ans=0.0 2023-06-24 19:10:48,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-24 19:10:50,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1164642.0, ans=0.125 2023-06-24 19:11:05,123 INFO [train.py:996] (3/4) Epoch 7, batch 11150, loss[loss=0.2002, simple_loss=0.2671, pruned_loss=0.06669, over 21609.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2865, pruned_loss=0.07196, over 4285695.17 frames. ], batch size: 332, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:11:13,369 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.42 vs. limit=6.0 2023-06-24 19:11:48,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1164822.0, ans=0.0 2023-06-24 19:12:52,266 INFO [train.py:996] (3/4) Epoch 7, batch 11200, loss[loss=0.1929, simple_loss=0.2621, pruned_loss=0.06187, over 21698.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2849, pruned_loss=0.07119, over 4290238.54 frames. ], batch size: 333, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:13:24,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1165062.0, ans=0.1 2023-06-24 19:13:35,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.570e+02 2.865e+02 3.266e+02 5.455e+02, threshold=5.730e+02, percent-clipped=0.0 2023-06-24 19:13:56,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1165182.0, ans=0.04949747468305833 2023-06-24 19:14:12,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1165182.0, ans=0.2 2023-06-24 19:14:29,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1165242.0, ans=0.0 2023-06-24 19:14:40,998 INFO [train.py:996] (3/4) Epoch 7, batch 11250, loss[loss=0.2166, simple_loss=0.2985, pruned_loss=0.06734, over 21182.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.285, pruned_loss=0.07078, over 4287346.46 frames. ], batch size: 143, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:15:22,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1165362.0, ans=0.125 2023-06-24 19:16:27,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1165542.0, ans=0.125 2023-06-24 19:16:31,007 INFO [train.py:996] (3/4) Epoch 7, batch 11300, loss[loss=0.1716, simple_loss=0.2473, pruned_loss=0.04794, over 15931.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2857, pruned_loss=0.07088, over 4286824.95 frames. ], batch size: 60, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:17:13,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1165662.0, ans=0.125 2023-06-24 19:17:13,932 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 2.821e+02 3.305e+02 4.579e+02 7.835e+02, threshold=6.611e+02, percent-clipped=6.0 2023-06-24 19:17:48,595 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-24 19:17:57,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1165842.0, ans=0.125 2023-06-24 19:18:19,899 INFO [train.py:996] (3/4) Epoch 7, batch 11350, loss[loss=0.2322, simple_loss=0.3103, pruned_loss=0.077, over 21615.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2872, pruned_loss=0.0699, over 4285047.44 frames. ], batch size: 230, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:18:20,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1165902.0, ans=0.0 2023-06-24 19:18:29,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1165902.0, ans=0.0 2023-06-24 19:18:40,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1165902.0, ans=0.0 2023-06-24 19:19:06,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1166022.0, ans=0.125 2023-06-24 19:20:11,169 INFO [train.py:996] (3/4) Epoch 7, batch 11400, loss[loss=0.2126, simple_loss=0.2835, pruned_loss=0.07084, over 21385.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2932, pruned_loss=0.07258, over 4283681.66 frames. ], batch size: 159, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:20:54,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1166322.0, ans=0.2 2023-06-24 19:20:56,086 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.882e+02 3.810e+02 4.991e+02 7.494e+02, threshold=7.619e+02, percent-clipped=6.0 2023-06-24 19:21:25,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-24 19:21:42,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-24 19:22:04,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1166502.0, ans=0.125 2023-06-24 19:22:06,441 INFO [train.py:996] (3/4) Epoch 7, batch 11450, loss[loss=0.2051, simple_loss=0.2866, pruned_loss=0.06179, over 21463.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2923, pruned_loss=0.07099, over 4279980.16 frames. ], batch size: 194, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:22:16,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=22.5 2023-06-24 19:22:38,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1166562.0, ans=0.0 2023-06-24 19:22:50,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1166622.0, ans=0.125 2023-06-24 19:23:29,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0 2023-06-24 19:23:59,139 INFO [train.py:996] (3/4) Epoch 7, batch 11500, loss[loss=0.2121, simple_loss=0.2971, pruned_loss=0.06356, over 21482.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2957, pruned_loss=0.07295, over 4275729.57 frames. ], batch size: 194, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:24:44,986 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.827e+02 3.371e+02 4.045e+02 6.932e+02, threshold=6.743e+02, percent-clipped=0.0 2023-06-24 19:24:49,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1166922.0, ans=0.1 2023-06-24 19:25:12,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1166982.0, ans=0.2 2023-06-24 19:25:48,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1167042.0, ans=0.0 2023-06-24 19:25:56,872 INFO [train.py:996] (3/4) Epoch 7, batch 11550, loss[loss=0.2831, simple_loss=0.3813, pruned_loss=0.09248, over 21671.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.302, pruned_loss=0.0728, over 4279506.13 frames. ], batch size: 414, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:27:06,099 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:27:13,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.78 vs. limit=10.0 2023-06-24 19:27:37,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1167342.0, ans=0.125 2023-06-24 19:27:46,798 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-06-24 19:27:48,859 INFO [train.py:996] (3/4) Epoch 7, batch 11600, loss[loss=0.2298, simple_loss=0.331, pruned_loss=0.0643, over 21571.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3159, pruned_loss=0.07455, over 4279748.71 frames. ], batch size: 230, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:28:34,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.839e+02 3.611e+02 4.809e+02 8.575e+02, threshold=7.221e+02, percent-clipped=4.0 2023-06-24 19:29:04,918 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:29:16,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1167582.0, ans=0.125 2023-06-24 19:29:29,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1167642.0, ans=0.0 2023-06-24 19:29:42,813 INFO [train.py:996] (3/4) Epoch 7, batch 11650, loss[loss=0.2351, simple_loss=0.318, pruned_loss=0.07605, over 20707.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3229, pruned_loss=0.07536, over 4274885.82 frames. ], batch size: 607, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:30:39,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1167822.0, ans=0.125 2023-06-24 19:31:02,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1167882.0, ans=0.125 2023-06-24 19:31:15,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1167942.0, ans=0.0 2023-06-24 19:31:32,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1168002.0, ans=0.125 2023-06-24 19:31:33,862 INFO [train.py:996] (3/4) Epoch 7, batch 11700, loss[loss=0.2079, simple_loss=0.2785, pruned_loss=0.06862, over 15780.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3137, pruned_loss=0.07475, over 4264759.68 frames. ], batch size: 65, lr: 4.35e-03, grad_scale: 32.0 2023-06-24 19:32:16,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.666e+02 3.050e+02 3.571e+02 8.433e+02, threshold=6.100e+02, percent-clipped=2.0 2023-06-24 19:32:38,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1168182.0, ans=0.125 2023-06-24 19:33:22,090 INFO [train.py:996] (3/4) Epoch 7, batch 11750, loss[loss=0.2501, simple_loss=0.3189, pruned_loss=0.09061, over 21792.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3058, pruned_loss=0.07395, over 4253561.83 frames. ], batch size: 118, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:33:24,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1168302.0, ans=0.125 2023-06-24 19:34:15,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1168422.0, ans=0.125 2023-06-24 19:35:04,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1168542.0, ans=0.125 2023-06-24 19:35:14,522 INFO [train.py:996] (3/4) Epoch 7, batch 11800, loss[loss=0.2865, simple_loss=0.353, pruned_loss=0.1101, over 21404.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3065, pruned_loss=0.07576, over 4263780.76 frames. ], batch size: 507, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:35:20,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1168602.0, ans=0.125 2023-06-24 19:36:03,568 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.959e+02 3.685e+02 4.448e+02 7.783e+02, threshold=7.370e+02, percent-clipped=3.0 2023-06-24 19:36:15,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-24 19:36:34,513 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2023-06-24 19:37:04,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1168902.0, ans=0.125 2023-06-24 19:37:05,800 INFO [train.py:996] (3/4) Epoch 7, batch 11850, loss[loss=0.2686, simple_loss=0.3944, pruned_loss=0.07141, over 20748.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3094, pruned_loss=0.07483, over 4265747.80 frames. ], batch size: 607, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:37:43,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1168962.0, ans=0.04949747468305833 2023-06-24 19:38:48,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1169142.0, ans=0.0 2023-06-24 19:38:50,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1169142.0, ans=0.0 2023-06-24 19:39:02,920 INFO [train.py:996] (3/4) Epoch 7, batch 11900, loss[loss=0.2177, simple_loss=0.2838, pruned_loss=0.0758, over 21423.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3089, pruned_loss=0.07262, over 4266827.77 frames. ], batch size: 194, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:39:26,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1169262.0, ans=0.2 2023-06-24 19:39:41,720 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-24 19:39:46,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1169262.0, ans=0.0 2023-06-24 19:39:51,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.709e+02 3.163e+02 3.879e+02 8.042e+02, threshold=6.325e+02, percent-clipped=1.0 2023-06-24 19:40:02,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1169322.0, ans=0.125 2023-06-24 19:40:18,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1169382.0, ans=0.0 2023-06-24 19:40:58,224 INFO [train.py:996] (3/4) Epoch 7, batch 11950, loss[loss=0.1799, simple_loss=0.2777, pruned_loss=0.04102, over 21724.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3106, pruned_loss=0.07013, over 4260037.75 frames. ], batch size: 351, lr: 4.35e-03, grad_scale: 16.0 2023-06-24 19:42:01,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1169682.0, ans=0.125 2023-06-24 19:42:40,540 INFO [train.py:996] (3/4) Epoch 7, batch 12000, loss[loss=0.1781, simple_loss=0.2461, pruned_loss=0.05511, over 21229.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.305, pruned_loss=0.06867, over 4259486.82 frames. ], batch size: 176, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 19:42:40,541 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 19:43:01,773 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.261, simple_loss=0.3543, pruned_loss=0.08379, over 1796401.00 frames. 2023-06-24 19:43:01,774 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23409MB 2023-06-24 19:43:12,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1169802.0, ans=0.0 2023-06-24 19:43:44,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.826e+02 3.232e+02 4.022e+02 5.951e+02, threshold=6.465e+02, percent-clipped=0.0 2023-06-24 19:43:44,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1169922.0, ans=0.1 2023-06-24 19:43:56,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1169922.0, ans=0.1 2023-06-24 19:44:06,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=12.0 2023-06-24 19:44:56,218 INFO [train.py:996] (3/4) Epoch 7, batch 12050, loss[loss=0.2338, simple_loss=0.301, pruned_loss=0.08332, over 21853.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3005, pruned_loss=0.07075, over 4263634.66 frames. ], batch size: 391, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:45:06,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.45 vs. limit=15.0 2023-06-24 19:45:16,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1170162.0, ans=0.125 2023-06-24 19:46:48,303 INFO [train.py:996] (3/4) Epoch 7, batch 12100, loss[loss=0.2694, simple_loss=0.3384, pruned_loss=0.1002, over 21206.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3045, pruned_loss=0.07483, over 4273090.64 frames. ], batch size: 143, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:47:25,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-24 19:47:32,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1170522.0, ans=0.0 2023-06-24 19:47:33,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 3.018e+02 3.555e+02 4.988e+02 8.352e+02, threshold=7.110e+02, percent-clipped=5.0 2023-06-24 19:48:09,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1170582.0, ans=0.015 2023-06-24 19:48:27,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1170642.0, ans=0.0 2023-06-24 19:48:31,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1170642.0, ans=0.0 2023-06-24 19:48:41,163 INFO [train.py:996] (3/4) Epoch 7, batch 12150, loss[loss=0.1985, simple_loss=0.2767, pruned_loss=0.06019, over 21295.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3061, pruned_loss=0.07373, over 4264806.54 frames. ], batch size: 176, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:48:56,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1170702.0, ans=0.2 2023-06-24 19:49:09,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1170762.0, ans=0.0 2023-06-24 19:49:14,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1170762.0, ans=0.2 2023-06-24 19:49:46,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=22.5 2023-06-24 19:49:52,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1170882.0, ans=0.125 2023-06-24 19:50:10,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1170882.0, ans=0.125 2023-06-24 19:50:26,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1170942.0, ans=0.2 2023-06-24 19:50:30,862 INFO [train.py:996] (3/4) Epoch 7, batch 12200, loss[loss=0.1972, simple_loss=0.2557, pruned_loss=0.06934, over 21207.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3036, pruned_loss=0.07226, over 4262459.59 frames. ], batch size: 159, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:50:41,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1171002.0, ans=0.0 2023-06-24 19:50:50,875 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.68 vs. limit=6.0 2023-06-24 19:51:22,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1171122.0, ans=0.125 2023-06-24 19:51:25,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.032e+02 3.828e+02 4.856e+02 1.056e+03, threshold=7.657e+02, percent-clipped=7.0 2023-06-24 19:51:47,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1171182.0, ans=0.1 2023-06-24 19:52:08,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1171242.0, ans=0.1 2023-06-24 19:52:18,157 INFO [train.py:996] (3/4) Epoch 7, batch 12250, loss[loss=0.1769, simple_loss=0.2636, pruned_loss=0.04509, over 21686.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2951, pruned_loss=0.06934, over 4262499.80 frames. ], batch size: 391, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:52:22,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1171302.0, ans=0.0 2023-06-24 19:53:54,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1171542.0, ans=0.09899494936611666 2023-06-24 19:54:06,964 INFO [train.py:996] (3/4) Epoch 7, batch 12300, loss[loss=0.1578, simple_loss=0.229, pruned_loss=0.04326, over 21268.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2859, pruned_loss=0.06447, over 4253474.54 frames. ], batch size: 176, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:54:56,065 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.637e+02 2.162e+02 2.543e+02 3.041e+02 6.823e+02, threshold=5.086e+02, percent-clipped=0.0 2023-06-24 19:55:05,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1171722.0, ans=0.125 2023-06-24 19:55:48,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1171842.0, ans=0.1 2023-06-24 19:55:54,588 INFO [train.py:996] (3/4) Epoch 7, batch 12350, loss[loss=0.2168, simple_loss=0.2978, pruned_loss=0.06792, over 21498.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.29, pruned_loss=0.06525, over 4256003.77 frames. ], batch size: 131, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 19:56:07,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-24 19:56:18,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1171962.0, ans=0.2 2023-06-24 19:56:27,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1171962.0, ans=0.1 2023-06-24 19:57:27,798 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.15 vs. limit=22.5 2023-06-24 19:57:42,412 INFO [train.py:996] (3/4) Epoch 7, batch 12400, loss[loss=0.2279, simple_loss=0.3052, pruned_loss=0.07533, over 21851.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2939, pruned_loss=0.06895, over 4268005.48 frames. ], batch size: 414, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 19:57:43,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-24 19:57:52,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=22.5 2023-06-24 19:58:37,883 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.786e+02 3.157e+02 3.873e+02 7.298e+02, threshold=6.314e+02, percent-clipped=10.0 2023-06-24 19:58:46,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1172322.0, ans=0.125 2023-06-24 19:59:33,082 INFO [train.py:996] (3/4) Epoch 7, batch 12450, loss[loss=0.2755, simple_loss=0.3616, pruned_loss=0.09466, over 21811.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2978, pruned_loss=0.07186, over 4271493.24 frames. ], batch size: 124, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 19:59:51,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1172502.0, ans=0.125 2023-06-24 20:00:04,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1172562.0, ans=0.0 2023-06-24 20:00:07,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-24 20:00:08,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-24 20:01:28,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1172802.0, ans=0.0 2023-06-24 20:01:30,060 INFO [train.py:996] (3/4) Epoch 7, batch 12500, loss[loss=0.2122, simple_loss=0.3462, pruned_loss=0.03913, over 19783.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3087, pruned_loss=0.07411, over 4273591.79 frames. ], batch size: 702, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:01:51,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1172802.0, ans=0.125 2023-06-24 20:02:17,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1172862.0, ans=0.0 2023-06-24 20:02:24,572 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.093e+02 3.470e+02 4.423e+02 7.018e+02, threshold=6.940e+02, percent-clipped=1.0 2023-06-24 20:03:04,367 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 20:03:31,025 INFO [train.py:996] (3/4) Epoch 7, batch 12550, loss[loss=0.2179, simple_loss=0.3086, pruned_loss=0.06363, over 21729.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3128, pruned_loss=0.07676, over 4276136.39 frames. ], batch size: 332, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:03:54,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1173162.0, ans=0.125 2023-06-24 20:04:00,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1173162.0, ans=0.125 2023-06-24 20:04:21,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.60 vs. limit=10.0 2023-06-24 20:05:13,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1173342.0, ans=0.125 2023-06-24 20:05:21,009 INFO [train.py:996] (3/4) Epoch 7, batch 12600, loss[loss=0.207, simple_loss=0.2961, pruned_loss=0.05898, over 21697.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3115, pruned_loss=0.0746, over 4275600.93 frames. ], batch size: 351, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:05:32,470 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-24 20:05:37,739 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-24 20:06:05,564 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.821e+02 3.460e+02 4.531e+02 8.641e+02, threshold=6.920e+02, percent-clipped=2.0 2023-06-24 20:06:25,639 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.72 vs. limit=15.0 2023-06-24 20:06:30,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1173582.0, ans=0.95 2023-06-24 20:07:00,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1173642.0, ans=0.125 2023-06-24 20:07:13,634 INFO [train.py:996] (3/4) Epoch 7, batch 12650, loss[loss=0.2357, simple_loss=0.2969, pruned_loss=0.08727, over 21282.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3055, pruned_loss=0.07163, over 4277391.04 frames. ], batch size: 143, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:07:35,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1173762.0, ans=0.125 2023-06-24 20:08:56,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1173942.0, ans=0.1 2023-06-24 20:09:00,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1174002.0, ans=0.2 2023-06-24 20:09:02,167 INFO [train.py:996] (3/4) Epoch 7, batch 12700, loss[loss=0.2418, simple_loss=0.3123, pruned_loss=0.08564, over 21744.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.304, pruned_loss=0.0737, over 4286678.68 frames. ], batch size: 332, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:09:06,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1174002.0, ans=0.1 2023-06-24 20:09:18,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-24 20:09:37,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1174062.0, ans=0.125 2023-06-24 20:09:47,814 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 2.796e+02 3.277e+02 3.938e+02 5.852e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-24 20:10:50,766 INFO [train.py:996] (3/4) Epoch 7, batch 12750, loss[loss=0.2065, simple_loss=0.2999, pruned_loss=0.05657, over 21644.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3056, pruned_loss=0.07435, over 4289461.16 frames. ], batch size: 263, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:10:51,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1174302.0, ans=0.125 2023-06-24 20:12:30,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1174542.0, ans=0.125 2023-06-24 20:12:39,034 INFO [train.py:996] (3/4) Epoch 7, batch 12800, loss[loss=0.2328, simple_loss=0.3018, pruned_loss=0.0819, over 20073.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.306, pruned_loss=0.07459, over 4284392.17 frames. ], batch size: 702, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:12:48,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1174602.0, ans=0.125 2023-06-24 20:12:56,114 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-24 20:13:03,365 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-24 20:13:25,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.978e+02 3.549e+02 4.677e+02 8.571e+02, threshold=7.098e+02, percent-clipped=5.0 2023-06-24 20:13:28,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.53 vs. limit=22.5 2023-06-24 20:14:06,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=12.0 2023-06-24 20:14:06,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=1174842.0, ans=12.0 2023-06-24 20:14:15,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1174842.0, ans=0.0 2023-06-24 20:14:22,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1174842.0, ans=0.125 2023-06-24 20:14:25,227 INFO [train.py:996] (3/4) Epoch 7, batch 12850, loss[loss=0.1856, simple_loss=0.2816, pruned_loss=0.04479, over 21749.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3078, pruned_loss=0.07547, over 4282227.49 frames. ], batch size: 282, lr: 4.34e-03, grad_scale: 32.0 2023-06-24 20:14:30,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1174902.0, ans=0.1 2023-06-24 20:15:50,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1175142.0, ans=0.0 2023-06-24 20:16:05,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1175142.0, ans=0.07 2023-06-24 20:16:16,211 INFO [train.py:996] (3/4) Epoch 7, batch 12900, loss[loss=0.2287, simple_loss=0.3184, pruned_loss=0.06952, over 21652.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.306, pruned_loss=0.07263, over 4276131.92 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 16.0 2023-06-24 20:16:45,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=15.0 2023-06-24 20:17:09,057 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.09 vs. limit=15.0 2023-06-24 20:17:14,884 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 2.556e+02 2.922e+02 3.625e+02 8.221e+02, threshold=5.845e+02, percent-clipped=4.0 2023-06-24 20:17:57,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.83 vs. limit=12.0 2023-06-24 20:18:05,564 INFO [train.py:996] (3/4) Epoch 7, batch 12950, loss[loss=0.2028, simple_loss=0.2805, pruned_loss=0.0625, over 21414.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3041, pruned_loss=0.0711, over 4279598.91 frames. ], batch size: 194, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:18:32,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1175562.0, ans=0.125 2023-06-24 20:19:53,387 INFO [train.py:996] (3/4) Epoch 7, batch 13000, loss[loss=0.1795, simple_loss=0.2649, pruned_loss=0.04709, over 21732.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3036, pruned_loss=0.07081, over 4276173.41 frames. ], batch size: 298, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:20:50,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.748e+02 3.242e+02 4.275e+02 7.846e+02, threshold=6.485e+02, percent-clipped=8.0 2023-06-24 20:21:20,083 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.10 vs. limit=10.0 2023-06-24 20:21:21,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1176042.0, ans=0.125 2023-06-24 20:21:41,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1176102.0, ans=0.125 2023-06-24 20:21:43,441 INFO [train.py:996] (3/4) Epoch 7, batch 13050, loss[loss=0.21, simple_loss=0.2805, pruned_loss=0.06979, over 21343.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2988, pruned_loss=0.06899, over 4284213.23 frames. ], batch size: 144, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:22:49,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1176222.0, ans=0.125 2023-06-24 20:22:59,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1176282.0, ans=0.125 2023-06-24 20:23:03,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.39 vs. limit=12.0 2023-06-24 20:23:31,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1176402.0, ans=0.0 2023-06-24 20:23:32,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.30 vs. limit=10.0 2023-06-24 20:23:32,681 INFO [train.py:996] (3/4) Epoch 7, batch 13100, loss[loss=0.2061, simple_loss=0.2899, pruned_loss=0.06112, over 21746.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3, pruned_loss=0.06902, over 4291801.48 frames. ], batch size: 247, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:23:44,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1176402.0, ans=0.2 2023-06-24 20:24:28,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1176522.0, ans=0.125 2023-06-24 20:24:31,400 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.745e+02 3.057e+02 3.676e+02 6.184e+02, threshold=6.113e+02, percent-clipped=0.0 2023-06-24 20:24:59,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1176582.0, ans=0.125 2023-06-24 20:25:33,935 INFO [train.py:996] (3/4) Epoch 7, batch 13150, loss[loss=0.2699, simple_loss=0.3357, pruned_loss=0.102, over 21454.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3035, pruned_loss=0.07162, over 4292835.09 frames. ], batch size: 471, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:25:55,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-24 20:25:58,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1176762.0, ans=0.0 2023-06-24 20:26:22,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1176822.0, ans=0.09899494936611666 2023-06-24 20:26:56,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.32 vs. limit=15.0 2023-06-24 20:27:28,471 INFO [train.py:996] (3/4) Epoch 7, batch 13200, loss[loss=0.2352, simple_loss=0.3157, pruned_loss=0.07731, over 21825.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3026, pruned_loss=0.07174, over 4292645.87 frames. ], batch size: 124, lr: 4.33e-03, grad_scale: 32.0 2023-06-24 20:27:48,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1177062.0, ans=0.125 2023-06-24 20:28:17,673 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.491e+02 2.990e+02 3.679e+02 4.765e+02 8.248e+02, threshold=7.359e+02, percent-clipped=11.0 2023-06-24 20:29:08,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-24 20:29:13,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1177242.0, ans=0.2 2023-06-24 20:29:18,287 INFO [train.py:996] (3/4) Epoch 7, batch 13250, loss[loss=0.2129, simple_loss=0.2958, pruned_loss=0.06502, over 21472.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3031, pruned_loss=0.0745, over 4294762.62 frames. ], batch size: 211, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:29:18,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1177302.0, ans=0.125 2023-06-24 20:29:45,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1177362.0, ans=0.2 2023-06-24 20:29:57,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-24 20:29:59,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.27 vs. limit=15.0 2023-06-24 20:30:00,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1177422.0, ans=0.125 2023-06-24 20:30:32,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1177482.0, ans=0.0 2023-06-24 20:31:09,756 INFO [train.py:996] (3/4) Epoch 7, batch 13300, loss[loss=0.2672, simple_loss=0.3425, pruned_loss=0.09594, over 21865.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3073, pruned_loss=0.07503, over 4279965.11 frames. ], batch size: 118, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:31:13,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1177602.0, ans=0.125 2023-06-24 20:31:45,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1177662.0, ans=0.0 2023-06-24 20:32:02,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1177722.0, ans=0.0 2023-06-24 20:32:10,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.867e+02 3.500e+02 4.353e+02 7.353e+02, threshold=7.001e+02, percent-clipped=0.0 2023-06-24 20:32:11,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-24 20:32:44,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-24 20:32:57,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1177842.0, ans=0.1 2023-06-24 20:33:00,259 INFO [train.py:996] (3/4) Epoch 7, batch 13350, loss[loss=0.207, simple_loss=0.295, pruned_loss=0.05953, over 20808.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3115, pruned_loss=0.07747, over 4278853.66 frames. ], batch size: 607, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:33:01,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-24 20:34:08,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1178022.0, ans=0.2 2023-06-24 20:34:44,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1178142.0, ans=0.0 2023-06-24 20:34:44,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1178142.0, ans=0.0 2023-06-24 20:34:48,885 INFO [train.py:996] (3/4) Epoch 7, batch 13400, loss[loss=0.2416, simple_loss=0.3122, pruned_loss=0.08544, over 21794.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.313, pruned_loss=0.07961, over 4280523.33 frames. ], batch size: 414, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:35:02,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1178202.0, ans=0.125 2023-06-24 20:35:27,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1178262.0, ans=0.2 2023-06-24 20:35:54,570 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.870e+02 3.236e+02 3.893e+02 7.079e+02, threshold=6.472e+02, percent-clipped=1.0 2023-06-24 20:35:55,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1178322.0, ans=0.04949747468305833 2023-06-24 20:36:01,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-24 20:36:17,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1178382.0, ans=0.1 2023-06-24 20:36:29,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1178442.0, ans=0.125 2023-06-24 20:36:43,528 INFO [train.py:996] (3/4) Epoch 7, batch 13450, loss[loss=0.2319, simple_loss=0.3071, pruned_loss=0.07841, over 21459.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3135, pruned_loss=0.08118, over 4283491.69 frames. ], batch size: 131, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:36:56,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-24 20:37:23,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1178562.0, ans=0.0 2023-06-24 20:38:16,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1178742.0, ans=0.0 2023-06-24 20:38:28,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1178742.0, ans=0.125 2023-06-24 20:38:33,344 INFO [train.py:996] (3/4) Epoch 7, batch 13500, loss[loss=0.2201, simple_loss=0.2918, pruned_loss=0.07424, over 21305.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3041, pruned_loss=0.07799, over 4281942.82 frames. ], batch size: 159, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:38:52,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1178802.0, ans=0.0 2023-06-24 20:39:04,046 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-24 20:39:35,907 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.421e+02 3.362e+02 3.847e+02 4.790e+02 7.815e+02, threshold=7.695e+02, percent-clipped=4.0 2023-06-24 20:40:30,487 INFO [train.py:996] (3/4) Epoch 7, batch 13550, loss[loss=0.2788, simple_loss=0.3779, pruned_loss=0.08982, over 21680.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3088, pruned_loss=0.07719, over 4281084.24 frames. ], batch size: 414, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:41:09,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1179162.0, ans=0.125 2023-06-24 20:41:12,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1179162.0, ans=0.125 2023-06-24 20:42:19,504 INFO [train.py:996] (3/4) Epoch 7, batch 13600, loss[loss=0.1947, simple_loss=0.2701, pruned_loss=0.05968, over 21483.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.309, pruned_loss=0.07707, over 4280659.22 frames. ], batch size: 194, lr: 4.33e-03, grad_scale: 32.0 2023-06-24 20:43:08,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-24 20:43:13,905 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.974e+02 2.752e+02 3.319e+02 4.170e+02 8.424e+02, threshold=6.637e+02, percent-clipped=2.0 2023-06-24 20:43:21,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1179582.0, ans=0.0 2023-06-24 20:43:24,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1179582.0, ans=0.125 2023-06-24 20:44:13,936 INFO [train.py:996] (3/4) Epoch 7, batch 13650, loss[loss=0.2148, simple_loss=0.2792, pruned_loss=0.07517, over 21568.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3033, pruned_loss=0.0733, over 4281109.88 frames. ], batch size: 414, lr: 4.33e-03, grad_scale: 32.0 2023-06-24 20:44:16,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-06-24 20:44:21,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1179702.0, ans=15.0 2023-06-24 20:44:41,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1179762.0, ans=0.125 2023-06-24 20:44:59,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1179822.0, ans=0.0 2023-06-24 20:45:08,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1179822.0, ans=0.125 2023-06-24 20:46:02,769 INFO [train.py:996] (3/4) Epoch 7, batch 13700, loss[loss=0.1859, simple_loss=0.2423, pruned_loss=0.06474, over 21756.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2978, pruned_loss=0.07324, over 4266876.30 frames. ], batch size: 124, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:46:26,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1180062.0, ans=0.125 2023-06-24 20:46:53,962 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.933e+02 3.408e+02 4.386e+02 8.480e+02, threshold=6.816e+02, percent-clipped=3.0 2023-06-24 20:47:55,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1180242.0, ans=0.07 2023-06-24 20:47:58,386 INFO [train.py:996] (3/4) Epoch 7, batch 13750, loss[loss=0.2, simple_loss=0.2729, pruned_loss=0.06355, over 21595.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2952, pruned_loss=0.07197, over 4271913.75 frames. ], batch size: 230, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:48:22,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1180362.0, ans=0.2 2023-06-24 20:48:46,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1180422.0, ans=0.05 2023-06-24 20:48:56,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1180422.0, ans=0.125 2023-06-24 20:49:20,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1180482.0, ans=0.0 2023-06-24 20:49:51,681 INFO [train.py:996] (3/4) Epoch 7, batch 13800, loss[loss=0.2355, simple_loss=0.3437, pruned_loss=0.06367, over 21784.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2996, pruned_loss=0.07044, over 4273795.27 frames. ], batch size: 332, lr: 4.33e-03, grad_scale: 16.0 2023-06-24 20:50:12,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.30 vs. limit=6.0 2023-06-24 20:50:46,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1180722.0, ans=0.125 2023-06-24 20:50:55,000 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.819e+02 3.661e+02 5.277e+02 1.106e+03, threshold=7.321e+02, percent-clipped=8.0 2023-06-24 20:51:01,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1180782.0, ans=0.0 2023-06-24 20:51:42,290 INFO [train.py:996] (3/4) Epoch 7, batch 13850, loss[loss=0.2758, simple_loss=0.352, pruned_loss=0.09982, over 21324.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3059, pruned_loss=0.07227, over 4271125.93 frames. ], batch size: 548, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 20:52:27,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1181022.0, ans=0.0 2023-06-24 20:52:58,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1181082.0, ans=0.0 2023-06-24 20:53:33,249 INFO [train.py:996] (3/4) Epoch 7, batch 13900, loss[loss=0.2469, simple_loss=0.3153, pruned_loss=0.08919, over 21804.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3089, pruned_loss=0.07544, over 4272920.70 frames. ], batch size: 351, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 20:54:22,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1181322.0, ans=0.0 2023-06-24 20:54:34,860 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.148e+02 3.792e+02 4.891e+02 9.530e+02, threshold=7.583e+02, percent-clipped=4.0 2023-06-24 20:54:45,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1181382.0, ans=0.125 2023-06-24 20:54:57,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1181382.0, ans=0.1 2023-06-24 20:55:22,185 INFO [train.py:996] (3/4) Epoch 7, batch 13950, loss[loss=0.2483, simple_loss=0.3263, pruned_loss=0.08513, over 21850.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3086, pruned_loss=0.07719, over 4285384.46 frames. ], batch size: 414, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 20:55:55,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1181562.0, ans=0.1 2023-06-24 20:56:16,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1181622.0, ans=0.0 2023-06-24 20:56:39,607 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-24 20:56:40,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1181682.0, ans=0.125 2023-06-24 20:56:52,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1181742.0, ans=0.2 2023-06-24 20:57:09,133 INFO [train.py:996] (3/4) Epoch 7, batch 14000, loss[loss=0.2177, simple_loss=0.2832, pruned_loss=0.07608, over 20065.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3068, pruned_loss=0.0754, over 4280750.67 frames. ], batch size: 702, lr: 4.32e-03, grad_scale: 32.0 2023-06-24 20:57:17,675 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-24 20:57:42,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1181862.0, ans=0.0 2023-06-24 20:57:46,300 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.56 vs. limit=15.0 2023-06-24 20:58:14,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 2.939e+02 3.299e+02 3.866e+02 1.368e+03, threshold=6.598e+02, percent-clipped=4.0 2023-06-24 20:58:20,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1181982.0, ans=0.125 2023-06-24 20:58:56,590 INFO [train.py:996] (3/4) Epoch 7, batch 14050, loss[loss=0.183, simple_loss=0.2834, pruned_loss=0.04133, over 21748.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3012, pruned_loss=0.07116, over 4282574.72 frames. ], batch size: 298, lr: 4.32e-03, grad_scale: 32.0 2023-06-24 20:59:32,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-24 20:59:46,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1182222.0, ans=0.1 2023-06-24 21:00:01,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1182222.0, ans=0.0 2023-06-24 21:00:26,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1182342.0, ans=0.2 2023-06-24 21:00:44,818 INFO [train.py:996] (3/4) Epoch 7, batch 14100, loss[loss=0.2334, simple_loss=0.2992, pruned_loss=0.08376, over 21332.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2951, pruned_loss=0.07052, over 4283306.72 frames. ], batch size: 131, lr: 4.32e-03, grad_scale: 32.0 2023-06-24 21:01:01,255 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:01:25,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1182462.0, ans=0.125 2023-06-24 21:01:52,073 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.658e+02 3.185e+02 3.657e+02 7.559e+02, threshold=6.369e+02, percent-clipped=1.0 2023-06-24 21:02:01,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1182582.0, ans=0.2 2023-06-24 21:02:16,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1182642.0, ans=0.125 2023-06-24 21:02:22,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1182642.0, ans=0.125 2023-06-24 21:02:29,762 INFO [train.py:996] (3/4) Epoch 7, batch 14150, loss[loss=0.215, simple_loss=0.3056, pruned_loss=0.06222, over 21630.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2988, pruned_loss=0.07187, over 4283985.57 frames. ], batch size: 263, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:03:58,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1182942.0, ans=0.2 2023-06-24 21:04:14,271 INFO [train.py:996] (3/4) Epoch 7, batch 14200, loss[loss=0.2068, simple_loss=0.2814, pruned_loss=0.06605, over 21374.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2976, pruned_loss=0.06989, over 4277910.59 frames. ], batch size: 194, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:05:00,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1183122.0, ans=0.1 2023-06-24 21:05:14,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1183122.0, ans=0.125 2023-06-24 21:05:17,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.744e+02 2.638e+02 3.052e+02 3.885e+02 7.622e+02, threshold=6.105e+02, percent-clipped=2.0 2023-06-24 21:06:03,293 INFO [train.py:996] (3/4) Epoch 7, batch 14250, loss[loss=0.2245, simple_loss=0.2817, pruned_loss=0.08362, over 20600.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.293, pruned_loss=0.07033, over 4267074.29 frames. ], batch size: 607, lr: 4.32e-03, grad_scale: 8.0 2023-06-24 21:07:19,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1183482.0, ans=0.025 2023-06-24 21:07:23,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1183482.0, ans=0.125 2023-06-24 21:07:26,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1183482.0, ans=0.125 2023-06-24 21:07:28,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1183482.0, ans=0.125 2023-06-24 21:07:42,720 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-24 21:07:52,511 INFO [train.py:996] (3/4) Epoch 7, batch 14300, loss[loss=0.237, simple_loss=0.3271, pruned_loss=0.07341, over 21676.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2955, pruned_loss=0.07086, over 4266481.28 frames. ], batch size: 247, lr: 4.32e-03, grad_scale: 8.0 2023-06-24 21:08:30,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1183722.0, ans=0.0 2023-06-24 21:08:56,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.824e+02 3.398e+02 4.914e+02 1.429e+03, threshold=6.796e+02, percent-clipped=17.0 2023-06-24 21:09:10,736 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-24 21:09:40,479 INFO [train.py:996] (3/4) Epoch 7, batch 14350, loss[loss=0.2056, simple_loss=0.2888, pruned_loss=0.06124, over 21846.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3012, pruned_loss=0.07168, over 4272562.89 frames. ], batch size: 351, lr: 4.32e-03, grad_scale: 8.0 2023-06-24 21:10:47,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1184022.0, ans=0.0 2023-06-24 21:11:28,469 INFO [train.py:996] (3/4) Epoch 7, batch 14400, loss[loss=0.2032, simple_loss=0.2708, pruned_loss=0.06778, over 21429.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2978, pruned_loss=0.07176, over 4269335.48 frames. ], batch size: 212, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:11:34,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1184202.0, ans=0.125 2023-06-24 21:11:36,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1184202.0, ans=0.0 2023-06-24 21:11:45,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.37 vs. limit=10.0 2023-06-24 21:12:09,528 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:12:29,576 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:12:32,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.835e+02 3.374e+02 4.163e+02 7.231e+02, threshold=6.749e+02, percent-clipped=2.0 2023-06-24 21:13:03,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1184442.0, ans=0.1 2023-06-24 21:13:14,855 INFO [train.py:996] (3/4) Epoch 7, batch 14450, loss[loss=0.2469, simple_loss=0.2895, pruned_loss=0.1021, over 21593.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2921, pruned_loss=0.07173, over 4265754.72 frames. ], batch size: 508, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:14:29,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1184682.0, ans=0.0 2023-06-24 21:15:03,221 INFO [train.py:996] (3/4) Epoch 7, batch 14500, loss[loss=0.2339, simple_loss=0.3622, pruned_loss=0.05278, over 20813.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2895, pruned_loss=0.07151, over 4268427.98 frames. ], batch size: 607, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:16:08,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.747e+02 3.186e+02 4.190e+02 7.871e+02, threshold=6.373e+02, percent-clipped=3.0 2023-06-24 21:16:41,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1185042.0, ans=0.0 2023-06-24 21:16:41,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1185042.0, ans=0.125 2023-06-24 21:16:47,454 INFO [train.py:996] (3/4) Epoch 7, batch 14550, loss[loss=0.2662, simple_loss=0.3366, pruned_loss=0.09786, over 21689.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2948, pruned_loss=0.07272, over 4268853.29 frames. ], batch size: 351, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:18:20,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1185342.0, ans=0.0 2023-06-24 21:18:37,535 INFO [train.py:996] (3/4) Epoch 7, batch 14600, loss[loss=0.2147, simple_loss=0.3041, pruned_loss=0.06263, over 21321.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3018, pruned_loss=0.07597, over 4263558.06 frames. ], batch size: 176, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:19:42,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.108e+02 3.903e+02 5.552e+02 1.166e+03, threshold=7.806e+02, percent-clipped=17.0 2023-06-24 21:19:44,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1185582.0, ans=0.0 2023-06-24 21:19:48,769 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-24 21:20:04,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1185642.0, ans=0.1 2023-06-24 21:20:05,800 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:20:20,968 INFO [train.py:996] (3/4) Epoch 7, batch 14650, loss[loss=0.2439, simple_loss=0.338, pruned_loss=0.07489, over 21281.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3045, pruned_loss=0.07533, over 4271286.51 frames. ], batch size: 548, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:20:52,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.19 vs. limit=15.0 2023-06-24 21:21:03,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1185762.0, ans=0.07 2023-06-24 21:21:10,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1185822.0, ans=0.0 2023-06-24 21:21:28,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=11.37 vs. limit=15.0 2023-06-24 21:22:00,303 INFO [train.py:996] (3/4) Epoch 7, batch 14700, loss[loss=0.2358, simple_loss=0.3362, pruned_loss=0.06771, over 21673.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2978, pruned_loss=0.06987, over 4271391.66 frames. ], batch size: 389, lr: 4.32e-03, grad_scale: 16.0 2023-06-24 21:22:58,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.08 vs. limit=22.5 2023-06-24 21:23:06,188 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.798e+02 2.369e+02 2.874e+02 3.417e+02 6.463e+02, threshold=5.748e+02, percent-clipped=0.0 2023-06-24 21:23:13,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=1186182.0, ans=12.0 2023-06-24 21:23:19,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1186182.0, ans=0.0 2023-06-24 21:23:34,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1186242.0, ans=0.125 2023-06-24 21:23:51,805 INFO [train.py:996] (3/4) Epoch 7, batch 14750, loss[loss=0.1495, simple_loss=0.2068, pruned_loss=0.04609, over 16382.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3027, pruned_loss=0.07249, over 4262140.76 frames. ], batch size: 60, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:24:10,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1186302.0, ans=0.125 2023-06-24 21:24:36,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1186362.0, ans=0.125 2023-06-24 21:24:52,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1186422.0, ans=0.125 2023-06-24 21:25:07,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-24 21:25:48,299 INFO [train.py:996] (3/4) Epoch 7, batch 14800, loss[loss=0.2754, simple_loss=0.3289, pruned_loss=0.1109, over 21302.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.315, pruned_loss=0.07858, over 4260790.31 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:26:21,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1186662.0, ans=0.1 2023-06-24 21:26:29,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=22.5 2023-06-24 21:26:42,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1186722.0, ans=0.125 2023-06-24 21:26:44,092 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.322e+02 4.309e+02 5.612e+02 1.041e+03, threshold=8.619e+02, percent-clipped=22.0 2023-06-24 21:27:11,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-06-24 21:27:44,128 INFO [train.py:996] (3/4) Epoch 7, batch 14850, loss[loss=0.2342, simple_loss=0.3054, pruned_loss=0.08148, over 21752.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3082, pruned_loss=0.0779, over 4265186.41 frames. ], batch size: 282, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:27:46,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1186902.0, ans=0.0 2023-06-24 21:28:06,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1186962.0, ans=0.0 2023-06-24 21:28:33,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1187022.0, ans=0.125 2023-06-24 21:29:03,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1187082.0, ans=0.0 2023-06-24 21:29:24,142 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-24 21:29:30,031 INFO [train.py:996] (3/4) Epoch 7, batch 14900, loss[loss=0.3128, simple_loss=0.3661, pruned_loss=0.1298, over 21398.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3116, pruned_loss=0.07955, over 4265953.84 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:29:37,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1187202.0, ans=0.2 2023-06-24 21:29:52,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1187262.0, ans=0.125 2023-06-24 21:30:14,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1187322.0, ans=0.2 2023-06-24 21:30:18,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1187322.0, ans=0.1 2023-06-24 21:30:36,825 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 3.164e+02 3.884e+02 4.869e+02 8.267e+02, threshold=7.767e+02, percent-clipped=0.0 2023-06-24 21:30:37,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1187382.0, ans=0.05 2023-06-24 21:30:46,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1187382.0, ans=0.0 2023-06-24 21:31:01,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1187442.0, ans=0.1 2023-06-24 21:31:01,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.19 vs. limit=12.0 2023-06-24 21:31:20,157 INFO [train.py:996] (3/4) Epoch 7, batch 14950, loss[loss=0.1917, simple_loss=0.2854, pruned_loss=0.04901, over 21741.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3112, pruned_loss=0.07841, over 4264204.23 frames. ], batch size: 351, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:32:04,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1187622.0, ans=0.035 2023-06-24 21:32:19,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1187622.0, ans=0.125 2023-06-24 21:32:27,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1187622.0, ans=0.2 2023-06-24 21:33:09,299 INFO [train.py:996] (3/4) Epoch 7, batch 15000, loss[loss=0.2209, simple_loss=0.2908, pruned_loss=0.07556, over 21783.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3141, pruned_loss=0.07987, over 4261748.52 frames. ], batch size: 247, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:33:09,300 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 21:33:26,466 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2547, simple_loss=0.3504, pruned_loss=0.07951, over 1796401.00 frames. 2023-06-24 21:33:26,467 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23409MB 2023-06-24 21:34:25,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1187922.0, ans=0.125 2023-06-24 21:34:34,454 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:34:34,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1187922.0, ans=0.025 2023-06-24 21:34:39,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.767e+02 3.159e+02 3.696e+02 5.819e+02, threshold=6.318e+02, percent-clipped=0.0 2023-06-24 21:35:12,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1188042.0, ans=0.125 2023-06-24 21:35:17,606 INFO [train.py:996] (3/4) Epoch 7, batch 15050, loss[loss=0.2217, simple_loss=0.3075, pruned_loss=0.06792, over 21760.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3156, pruned_loss=0.08125, over 4265441.24 frames. ], batch size: 282, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:35:35,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1188102.0, ans=0.0 2023-06-24 21:36:00,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1188162.0, ans=0.95 2023-06-24 21:36:19,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1188222.0, ans=0.1 2023-06-24 21:36:21,866 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-24 21:36:21,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=22.5 2023-06-24 21:37:07,755 INFO [train.py:996] (3/4) Epoch 7, batch 15100, loss[loss=0.2234, simple_loss=0.2767, pruned_loss=0.08503, over 20194.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3182, pruned_loss=0.08156, over 4264483.04 frames. ], batch size: 703, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:37:44,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1188462.0, ans=0.05 2023-06-24 21:37:44,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1188462.0, ans=0.125 2023-06-24 21:37:51,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1188462.0, ans=0.125 2023-06-24 21:38:00,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1188522.0, ans=0.125 2023-06-24 21:38:07,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1188522.0, ans=0.125 2023-06-24 21:38:09,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1188522.0, ans=0.125 2023-06-24 21:38:13,587 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.973e+02 3.589e+02 4.717e+02 7.835e+02, threshold=7.177e+02, percent-clipped=5.0 2023-06-24 21:38:17,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1188582.0, ans=0.0 2023-06-24 21:38:38,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1188642.0, ans=0.1 2023-06-24 21:38:59,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1188702.0, ans=0.1 2023-06-24 21:39:00,111 INFO [train.py:996] (3/4) Epoch 7, batch 15150, loss[loss=0.2262, simple_loss=0.2889, pruned_loss=0.08175, over 21815.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3142, pruned_loss=0.08166, over 4272441.86 frames. ], batch size: 98, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:39:25,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1188702.0, ans=0.125 2023-06-24 21:39:51,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1188822.0, ans=0.1 2023-06-24 21:39:57,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1188822.0, ans=0.0 2023-06-24 21:40:07,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1188882.0, ans=0.0 2023-06-24 21:40:49,664 INFO [train.py:996] (3/4) Epoch 7, batch 15200, loss[loss=0.1932, simple_loss=0.2814, pruned_loss=0.05245, over 21700.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3048, pruned_loss=0.07653, over 4272013.84 frames. ], batch size: 351, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:41:16,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1189062.0, ans=0.125 2023-06-24 21:41:51,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.555e+02 2.882e+02 3.442e+02 5.882e+02, threshold=5.763e+02, percent-clipped=0.0 2023-06-24 21:42:31,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-24 21:42:49,798 INFO [train.py:996] (3/4) Epoch 7, batch 15250, loss[loss=0.1877, simple_loss=0.2574, pruned_loss=0.05901, over 21668.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2986, pruned_loss=0.07504, over 4264985.98 frames. ], batch size: 282, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:42:59,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1189302.0, ans=0.125 2023-06-24 21:43:06,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1189362.0, ans=0.125 2023-06-24 21:43:36,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1189422.0, ans=0.1 2023-06-24 21:43:38,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1189422.0, ans=0.125 2023-06-24 21:44:40,022 INFO [train.py:996] (3/4) Epoch 7, batch 15300, loss[loss=0.292, simple_loss=0.3475, pruned_loss=0.1182, over 21591.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3003, pruned_loss=0.07756, over 4271626.84 frames. ], batch size: 415, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:44:49,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-24 21:45:01,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1189662.0, ans=0.0 2023-06-24 21:45:02,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.20 vs. limit=10.0 2023-06-24 21:45:11,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1189662.0, ans=0.2 2023-06-24 21:45:15,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1189722.0, ans=0.125 2023-06-24 21:45:17,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1189722.0, ans=0.0 2023-06-24 21:45:30,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1189782.0, ans=0.0 2023-06-24 21:45:37,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.236e+02 3.827e+02 4.813e+02 8.149e+02, threshold=7.653e+02, percent-clipped=14.0 2023-06-24 21:46:11,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=15.0 2023-06-24 21:46:27,867 INFO [train.py:996] (3/4) Epoch 7, batch 15350, loss[loss=0.266, simple_loss=0.3339, pruned_loss=0.09905, over 21276.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.306, pruned_loss=0.08012, over 4267054.25 frames. ], batch size: 143, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:46:28,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1189902.0, ans=0.125 2023-06-24 21:48:14,115 INFO [train.py:996] (3/4) Epoch 7, batch 15400, loss[loss=0.2004, simple_loss=0.3004, pruned_loss=0.05021, over 21773.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3058, pruned_loss=0.07769, over 4262179.72 frames. ], batch size: 298, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:48:32,401 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=22.5 2023-06-24 21:48:39,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1190262.0, ans=0.0 2023-06-24 21:48:58,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1190322.0, ans=0.035 2023-06-24 21:49:05,378 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.624e+02 3.015e+02 3.662e+02 6.507e+02, threshold=6.030e+02, percent-clipped=0.0 2023-06-24 21:49:22,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1190442.0, ans=0.025 2023-06-24 21:49:22,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1190442.0, ans=0.1 2023-06-24 21:50:02,499 INFO [train.py:996] (3/4) Epoch 7, batch 15450, loss[loss=0.1945, simple_loss=0.2799, pruned_loss=0.05451, over 21520.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3037, pruned_loss=0.07648, over 4267706.49 frames. ], batch size: 131, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:50:03,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.64 vs. limit=15.0 2023-06-24 21:50:11,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1190502.0, ans=0.0 2023-06-24 21:50:45,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1190622.0, ans=0.125 2023-06-24 21:50:57,617 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-24 21:51:52,838 INFO [train.py:996] (3/4) Epoch 7, batch 15500, loss[loss=0.2449, simple_loss=0.325, pruned_loss=0.08233, over 21873.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3069, pruned_loss=0.07688, over 4272067.07 frames. ], batch size: 371, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:51:54,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1190802.0, ans=0.125 2023-06-24 21:52:11,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1190862.0, ans=0.0 2023-06-24 21:52:16,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.33 vs. limit=12.0 2023-06-24 21:52:24,391 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-24 21:52:29,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1190922.0, ans=0.125 2023-06-24 21:52:38,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1190922.0, ans=0.0 2023-06-24 21:52:51,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 2.878e+02 3.263e+02 4.056e+02 7.756e+02, threshold=6.526e+02, percent-clipped=2.0 2023-06-24 21:52:56,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1190982.0, ans=0.0 2023-06-24 21:53:00,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-24 21:53:22,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1191042.0, ans=0.0 2023-06-24 21:53:34,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-24 21:53:37,042 INFO [train.py:996] (3/4) Epoch 7, batch 15550, loss[loss=0.2163, simple_loss=0.3045, pruned_loss=0.06405, over 21612.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3057, pruned_loss=0.0744, over 4263937.72 frames. ], batch size: 441, lr: 4.31e-03, grad_scale: 16.0 2023-06-24 21:55:08,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1191342.0, ans=0.125 2023-06-24 21:55:20,512 INFO [train.py:996] (3/4) Epoch 7, batch 15600, loss[loss=0.248, simple_loss=0.2892, pruned_loss=0.1034, over 21352.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2991, pruned_loss=0.07261, over 4258340.58 frames. ], batch size: 507, lr: 4.31e-03, grad_scale: 32.0 2023-06-24 21:55:26,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1191402.0, ans=0.125 2023-06-24 21:55:29,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-06-24 21:55:35,890 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=15.0 2023-06-24 21:56:23,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.726e+02 3.210e+02 4.134e+02 7.598e+02, threshold=6.420e+02, percent-clipped=3.0 2023-06-24 21:56:26,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1191582.0, ans=0.2 2023-06-24 21:56:33,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1191582.0, ans=0.2 2023-06-24 21:57:09,395 INFO [train.py:996] (3/4) Epoch 7, batch 15650, loss[loss=0.2257, simple_loss=0.2912, pruned_loss=0.08014, over 21502.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2974, pruned_loss=0.07196, over 4261635.46 frames. ], batch size: 195, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 21:57:22,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1191702.0, ans=0.0 2023-06-24 21:57:39,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1191762.0, ans=0.1 2023-06-24 21:58:13,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.76 vs. limit=10.0 2023-06-24 21:58:48,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1191942.0, ans=0.125 2023-06-24 21:58:57,013 INFO [train.py:996] (3/4) Epoch 7, batch 15700, loss[loss=0.1881, simple_loss=0.2521, pruned_loss=0.06211, over 15449.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.294, pruned_loss=0.07121, over 4263604.50 frames. ], batch size: 60, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:00:00,280 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.614e+02 3.168e+02 3.646e+02 5.632e+02, threshold=6.336e+02, percent-clipped=0.0 2023-06-24 22:00:18,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1192182.0, ans=0.1 2023-06-24 22:00:43,466 INFO [train.py:996] (3/4) Epoch 7, batch 15750, loss[loss=0.2699, simple_loss=0.3186, pruned_loss=0.1107, over 21411.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2892, pruned_loss=0.0712, over 4260351.24 frames. ], batch size: 508, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:01:02,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1192362.0, ans=0.0 2023-06-24 22:01:06,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1192362.0, ans=0.015 2023-06-24 22:01:27,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1192422.0, ans=0.125 2023-06-24 22:02:32,330 INFO [train.py:996] (3/4) Epoch 7, batch 15800, loss[loss=0.1822, simple_loss=0.2427, pruned_loss=0.0608, over 21478.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2841, pruned_loss=0.07116, over 4261644.11 frames. ], batch size: 230, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:02:52,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1192662.0, ans=0.04949747468305833 2023-06-24 22:03:02,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1192662.0, ans=0.0 2023-06-24 22:03:37,341 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.306e+02 2.697e+02 3.086e+02 3.699e+02 6.270e+02, threshold=6.172e+02, percent-clipped=0.0 2023-06-24 22:03:45,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1192782.0, ans=0.0 2023-06-24 22:03:49,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1192782.0, ans=0.025 2023-06-24 22:04:15,586 INFO [train.py:996] (3/4) Epoch 7, batch 15850, loss[loss=0.1945, simple_loss=0.2682, pruned_loss=0.06041, over 21875.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2869, pruned_loss=0.07317, over 4260054.33 frames. ], batch size: 317, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:05:56,013 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-24 22:06:01,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1193142.0, ans=0.0 2023-06-24 22:06:04,578 INFO [train.py:996] (3/4) Epoch 7, batch 15900, loss[loss=0.1914, simple_loss=0.2568, pruned_loss=0.06304, over 21779.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2851, pruned_loss=0.07286, over 4259604.74 frames. ], batch size: 317, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:06:11,915 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:06:19,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1193202.0, ans=0.2 2023-06-24 22:06:38,555 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=15.0 2023-06-24 22:06:54,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1193322.0, ans=0.125 2023-06-24 22:06:54,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1193322.0, ans=0.125 2023-06-24 22:07:04,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1193322.0, ans=0.125 2023-06-24 22:07:08,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1193382.0, ans=0.125 2023-06-24 22:07:09,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 2.996e+02 3.520e+02 4.315e+02 6.246e+02, threshold=7.040e+02, percent-clipped=3.0 2023-06-24 22:07:26,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1193382.0, ans=0.125 2023-06-24 22:07:53,076 INFO [train.py:996] (3/4) Epoch 7, batch 15950, loss[loss=0.2047, simple_loss=0.2795, pruned_loss=0.06492, over 21663.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2878, pruned_loss=0.07144, over 4265758.02 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:08:08,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-24 22:08:24,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=12.0 2023-06-24 22:08:25,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1193562.0, ans=0.125 2023-06-24 22:08:53,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1193622.0, ans=0.07 2023-06-24 22:08:55,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1193682.0, ans=0.125 2023-06-24 22:08:57,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1193682.0, ans=0.125 2023-06-24 22:09:18,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1193682.0, ans=0.0 2023-06-24 22:09:42,999 INFO [train.py:996] (3/4) Epoch 7, batch 16000, loss[loss=0.2069, simple_loss=0.311, pruned_loss=0.05143, over 21646.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2896, pruned_loss=0.06986, over 4266585.55 frames. ], batch size: 389, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:10:55,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 3.010e+02 3.950e+02 5.010e+02 9.750e+02, threshold=7.899e+02, percent-clipped=10.0 2023-06-24 22:11:32,510 INFO [train.py:996] (3/4) Epoch 7, batch 16050, loss[loss=0.1971, simple_loss=0.3048, pruned_loss=0.04467, over 20804.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2929, pruned_loss=0.06792, over 4262313.25 frames. ], batch size: 608, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:11:39,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1194102.0, ans=0.07 2023-06-24 22:11:58,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1194162.0, ans=0.0 2023-06-24 22:12:53,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-24 22:13:10,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1194342.0, ans=0.125 2023-06-24 22:13:20,134 INFO [train.py:996] (3/4) Epoch 7, batch 16100, loss[loss=0.218, simple_loss=0.2897, pruned_loss=0.07315, over 21201.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2973, pruned_loss=0.06899, over 4268597.74 frames. ], batch size: 143, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:13:27,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-24 22:13:56,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1194522.0, ans=0.125 2023-06-24 22:14:25,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 3.065e+02 3.753e+02 4.772e+02 1.110e+03, threshold=7.506e+02, percent-clipped=5.0 2023-06-24 22:14:31,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1194582.0, ans=0.125 2023-06-24 22:14:53,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1194642.0, ans=0.0 2023-06-24 22:15:06,571 INFO [train.py:996] (3/4) Epoch 7, batch 16150, loss[loss=0.2153, simple_loss=0.2813, pruned_loss=0.07462, over 21925.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2977, pruned_loss=0.07076, over 4271426.95 frames. ], batch size: 316, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:15:10,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1194702.0, ans=0.125 2023-06-24 22:15:52,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1194822.0, ans=0.1 2023-06-24 22:15:55,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1194822.0, ans=0.125 2023-06-24 22:16:09,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1194882.0, ans=0.125 2023-06-24 22:16:57,001 INFO [train.py:996] (3/4) Epoch 7, batch 16200, loss[loss=0.2577, simple_loss=0.3381, pruned_loss=0.08864, over 21410.00 frames. ], tot_loss[loss=0.223, simple_loss=0.302, pruned_loss=0.072, over 4275950.33 frames. ], batch size: 131, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:17:03,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1195002.0, ans=0.0 2023-06-24 22:17:06,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1195002.0, ans=0.0 2023-06-24 22:17:20,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1195062.0, ans=0.0 2023-06-24 22:18:15,479 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.476e+02 2.895e+02 3.394e+02 4.172e+02 8.958e+02, threshold=6.788e+02, percent-clipped=2.0 2023-06-24 22:18:25,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1195182.0, ans=10.0 2023-06-24 22:18:46,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1195302.0, ans=0.1 2023-06-24 22:18:47,723 INFO [train.py:996] (3/4) Epoch 7, batch 16250, loss[loss=0.1875, simple_loss=0.2611, pruned_loss=0.05695, over 21727.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3023, pruned_loss=0.07322, over 4277688.53 frames. ], batch size: 124, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:18:53,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1195302.0, ans=15.0 2023-06-24 22:19:30,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1195422.0, ans=0.125 2023-06-24 22:19:51,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1195422.0, ans=0.1 2023-06-24 22:20:12,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1195542.0, ans=0.125 2023-06-24 22:20:19,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1195542.0, ans=0.125 2023-06-24 22:20:31,168 INFO [train.py:996] (3/4) Epoch 7, batch 16300, loss[loss=0.178, simple_loss=0.2711, pruned_loss=0.04239, over 21655.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2968, pruned_loss=0.07017, over 4269173.52 frames. ], batch size: 247, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:20:57,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1195662.0, ans=0.125 2023-06-24 22:21:06,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1195662.0, ans=0.0 2023-06-24 22:21:07,631 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-24 22:21:48,946 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.667e+02 3.225e+02 3.668e+02 6.965e+02, threshold=6.450e+02, percent-clipped=1.0 2023-06-24 22:22:09,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-24 22:22:20,750 INFO [train.py:996] (3/4) Epoch 7, batch 16350, loss[loss=0.1979, simple_loss=0.272, pruned_loss=0.0619, over 21613.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2964, pruned_loss=0.07029, over 4270295.74 frames. ], batch size: 298, lr: 4.30e-03, grad_scale: 16.0 2023-06-24 22:22:28,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1195902.0, ans=0.0 2023-06-24 22:24:04,362 INFO [train.py:996] (3/4) Epoch 7, batch 16400, loss[loss=0.2475, simple_loss=0.3172, pruned_loss=0.08894, over 21779.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2995, pruned_loss=0.07206, over 4278197.99 frames. ], batch size: 441, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:24:07,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-24 22:24:19,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1196202.0, ans=0.125 2023-06-24 22:25:16,397 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.934e+02 3.396e+02 4.473e+02 6.388e+02, threshold=6.793e+02, percent-clipped=0.0 2023-06-24 22:25:48,958 INFO [train.py:996] (3/4) Epoch 7, batch 16450, loss[loss=0.2121, simple_loss=0.2914, pruned_loss=0.06634, over 21835.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2992, pruned_loss=0.07291, over 4282900.53 frames. ], batch size: 124, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:26:24,805 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-24 22:26:52,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1196622.0, ans=0.125 2023-06-24 22:27:03,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1196682.0, ans=0.0 2023-06-24 22:27:10,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1196682.0, ans=0.125 2023-06-24 22:27:32,607 INFO [train.py:996] (3/4) Epoch 7, batch 16500, loss[loss=0.1688, simple_loss=0.2439, pruned_loss=0.04683, over 21640.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2961, pruned_loss=0.07236, over 4281332.00 frames. ], batch size: 230, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:28:36,142 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.51 vs. limit=15.0 2023-06-24 22:28:51,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.249e+02 4.017e+02 5.671e+02 1.121e+03, threshold=8.034e+02, percent-clipped=17.0 2023-06-24 22:28:55,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1196982.0, ans=0.2 2023-06-24 22:28:58,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1196982.0, ans=0.025 2023-06-24 22:29:17,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1197042.0, ans=0.0 2023-06-24 22:29:23,154 INFO [train.py:996] (3/4) Epoch 7, batch 16550, loss[loss=0.2233, simple_loss=0.3089, pruned_loss=0.06886, over 21853.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2949, pruned_loss=0.07034, over 4283501.64 frames. ], batch size: 371, lr: 4.30e-03, grad_scale: 32.0 2023-06-24 22:29:39,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1197102.0, ans=0.0 2023-06-24 22:29:41,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1197102.0, ans=0.0 2023-06-24 22:29:56,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1197102.0, ans=0.0 2023-06-24 22:29:58,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2023-06-24 22:30:02,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1197162.0, ans=0.0 2023-06-24 22:30:25,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-24 22:30:26,798 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-24 22:30:35,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1197222.0, ans=0.0 2023-06-24 22:30:44,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1197282.0, ans=0.125 2023-06-24 22:31:37,494 INFO [train.py:996] (3/4) Epoch 7, batch 16600, loss[loss=0.2574, simple_loss=0.3398, pruned_loss=0.08752, over 21775.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3039, pruned_loss=0.07392, over 4284230.60 frames. ], batch size: 124, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:32:16,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1197522.0, ans=0.125 2023-06-24 22:32:21,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1197522.0, ans=0.0 2023-06-24 22:32:28,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1197522.0, ans=0.0 2023-06-24 22:32:36,774 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.639e+02 3.261e+02 4.003e+02 5.335e+02 1.096e+03, threshold=8.006e+02, percent-clipped=4.0 2023-06-24 22:33:08,251 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.86 vs. limit=15.0 2023-06-24 22:33:15,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1197642.0, ans=0.125 2023-06-24 22:33:29,479 INFO [train.py:996] (3/4) Epoch 7, batch 16650, loss[loss=0.2819, simple_loss=0.3522, pruned_loss=0.1058, over 21795.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3124, pruned_loss=0.07555, over 4278228.60 frames. ], batch size: 441, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:33:31,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1197702.0, ans=0.125 2023-06-24 22:33:56,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1197762.0, ans=22.5 2023-06-24 22:34:27,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1197822.0, ans=0.125 2023-06-24 22:35:17,466 INFO [train.py:996] (3/4) Epoch 7, batch 16700, loss[loss=0.2115, simple_loss=0.2859, pruned_loss=0.06855, over 21674.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3134, pruned_loss=0.0767, over 4272784.12 frames. ], batch size: 298, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:36:39,629 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.570e+02 3.449e+02 4.344e+02 5.804e+02 8.392e+02, threshold=8.689e+02, percent-clipped=2.0 2023-06-24 22:37:03,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1198242.0, ans=0.0 2023-06-24 22:37:12,547 INFO [train.py:996] (3/4) Epoch 7, batch 16750, loss[loss=0.2784, simple_loss=0.3701, pruned_loss=0.09334, over 21571.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3154, pruned_loss=0.07841, over 4272275.74 frames. ], batch size: 414, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:37:28,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1198302.0, ans=0.125 2023-06-24 22:38:58,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1198542.0, ans=0.1 2023-06-24 22:39:02,800 INFO [train.py:996] (3/4) Epoch 7, batch 16800, loss[loss=0.2187, simple_loss=0.2908, pruned_loss=0.07335, over 21864.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3194, pruned_loss=0.07886, over 4273684.06 frames. ], batch size: 107, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:39:24,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1198602.0, ans=0.0 2023-06-24 22:39:55,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1198662.0, ans=0.125 2023-06-24 22:40:20,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.624e+02 3.537e+02 4.384e+02 6.125e+02 1.119e+03, threshold=8.769e+02, percent-clipped=3.0 2023-06-24 22:40:55,216 INFO [train.py:996] (3/4) Epoch 7, batch 16850, loss[loss=0.1974, simple_loss=0.2694, pruned_loss=0.06265, over 21928.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3165, pruned_loss=0.07896, over 4279371.73 frames. ], batch size: 316, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:41:37,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-24 22:41:55,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1199022.0, ans=0.5 2023-06-24 22:42:06,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1199082.0, ans=0.125 2023-06-24 22:42:47,358 INFO [train.py:996] (3/4) Epoch 7, batch 16900, loss[loss=0.1951, simple_loss=0.2703, pruned_loss=0.05995, over 21654.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3111, pruned_loss=0.07761, over 4280297.90 frames. ], batch size: 332, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:43:07,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.85 vs. limit=22.5 2023-06-24 22:43:09,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.50 vs. limit=15.0 2023-06-24 22:43:24,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1199262.0, ans=0.2 2023-06-24 22:43:43,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1199322.0, ans=0.0 2023-06-24 22:43:54,624 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.672e+02 3.013e+02 3.696e+02 7.423e+02, threshold=6.025e+02, percent-clipped=0.0 2023-06-24 22:44:08,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199442.0, ans=0.1 2023-06-24 22:44:15,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1199442.0, ans=0.125 2023-06-24 22:44:34,204 INFO [train.py:996] (3/4) Epoch 7, batch 16950, loss[loss=0.2183, simple_loss=0.2961, pruned_loss=0.07025, over 21855.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3035, pruned_loss=0.07602, over 4284965.17 frames. ], batch size: 118, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:44:48,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199502.0, ans=0.1 2023-06-24 22:45:47,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199682.0, ans=0.1 2023-06-24 22:46:21,573 INFO [train.py:996] (3/4) Epoch 7, batch 17000, loss[loss=0.2364, simple_loss=0.305, pruned_loss=0.08391, over 21358.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2999, pruned_loss=0.07662, over 4285214.85 frames. ], batch size: 159, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:46:38,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1199802.0, ans=0.1 2023-06-24 22:46:56,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1199862.0, ans=0.1 2023-06-24 22:47:04,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-24 22:47:33,175 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.127e+02 3.708e+02 4.467e+02 7.774e+02, threshold=7.417e+02, percent-clipped=6.0 2023-06-24 22:48:18,530 INFO [train.py:996] (3/4) Epoch 7, batch 17050, loss[loss=0.2411, simple_loss=0.3249, pruned_loss=0.07863, over 21859.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3042, pruned_loss=0.0784, over 4284513.45 frames. ], batch size: 351, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:48:24,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-24 22:48:40,106 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:48:52,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-24 22:49:06,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-24 22:50:04,878 INFO [train.py:996] (3/4) Epoch 7, batch 17100, loss[loss=0.2351, simple_loss=0.3024, pruned_loss=0.08391, over 21837.00 frames. ], tot_loss[loss=0.231, simple_loss=0.304, pruned_loss=0.079, over 4283452.69 frames. ], batch size: 124, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:50:06,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1200402.0, ans=0.125 2023-06-24 22:50:46,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1200522.0, ans=0.1 2023-06-24 22:50:52,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1200522.0, ans=0.0 2023-06-24 22:51:07,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 2.876e+02 3.458e+02 4.009e+02 6.895e+02, threshold=6.917e+02, percent-clipped=0.0 2023-06-24 22:51:21,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1200642.0, ans=0.125 2023-06-24 22:51:31,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1200642.0, ans=0.1 2023-06-24 22:51:46,949 INFO [train.py:996] (3/4) Epoch 7, batch 17150, loss[loss=0.1723, simple_loss=0.2594, pruned_loss=0.04266, over 21759.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3, pruned_loss=0.07819, over 4290164.09 frames. ], batch size: 247, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:52:34,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=22.5 2023-06-24 22:52:34,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.09 vs. limit=10.0 2023-06-24 22:52:36,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1200822.0, ans=0.2 2023-06-24 22:52:40,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1200822.0, ans=0.0 2023-06-24 22:53:42,014 INFO [train.py:996] (3/4) Epoch 7, batch 17200, loss[loss=0.2087, simple_loss=0.2896, pruned_loss=0.06389, over 20738.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3004, pruned_loss=0.07774, over 4291445.90 frames. ], batch size: 608, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:54:42,275 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.02 vs. limit=22.5 2023-06-24 22:54:53,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.812e+02 3.269e+02 4.158e+02 6.698e+02, threshold=6.538e+02, percent-clipped=0.0 2023-06-24 22:55:33,463 INFO [train.py:996] (3/4) Epoch 7, batch 17250, loss[loss=0.264, simple_loss=0.3537, pruned_loss=0.08718, over 21520.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3035, pruned_loss=0.0786, over 4279241.14 frames. ], batch size: 131, lr: 4.29e-03, grad_scale: 32.0 2023-06-24 22:57:24,156 INFO [train.py:996] (3/4) Epoch 7, batch 17300, loss[loss=0.2668, simple_loss=0.341, pruned_loss=0.09626, over 21607.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3121, pruned_loss=0.08122, over 4276572.10 frames. ], batch size: 389, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:58:02,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1201662.0, ans=0.125 2023-06-24 22:58:47,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.589e+02 3.154e+02 3.783e+02 4.784e+02 7.470e+02, threshold=7.566e+02, percent-clipped=5.0 2023-06-24 22:59:15,033 INFO [train.py:996] (3/4) Epoch 7, batch 17350, loss[loss=0.2024, simple_loss=0.2903, pruned_loss=0.05727, over 21873.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3132, pruned_loss=0.08111, over 4273973.40 frames. ], batch size: 316, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 22:59:35,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1201902.0, ans=0.0 2023-06-24 23:00:36,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1202082.0, ans=0.0 2023-06-24 23:00:54,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1202142.0, ans=0.125 2023-06-24 23:01:03,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1202202.0, ans=0.07 2023-06-24 23:01:04,932 INFO [train.py:996] (3/4) Epoch 7, batch 17400, loss[loss=0.1992, simple_loss=0.2782, pruned_loss=0.06013, over 21592.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3095, pruned_loss=0.07772, over 4274663.62 frames. ], batch size: 263, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:01:07,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1202202.0, ans=0.125 2023-06-24 23:02:28,350 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.058e+02 3.682e+02 4.915e+02 8.567e+02, threshold=7.364e+02, percent-clipped=2.0 2023-06-24 23:02:43,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1202442.0, ans=0.04949747468305833 2023-06-24 23:02:45,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1202442.0, ans=0.1 2023-06-24 23:03:05,880 INFO [train.py:996] (3/4) Epoch 7, batch 17450, loss[loss=0.1689, simple_loss=0.2476, pruned_loss=0.0451, over 21174.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3084, pruned_loss=0.07591, over 4272900.09 frames. ], batch size: 176, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:03:30,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1202562.0, ans=0.125 2023-06-24 23:04:31,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1202742.0, ans=0.0 2023-06-24 23:04:35,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1202742.0, ans=0.04949747468305833 2023-06-24 23:04:56,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1202742.0, ans=0.2 2023-06-24 23:04:59,038 INFO [train.py:996] (3/4) Epoch 7, batch 17500, loss[loss=0.2642, simple_loss=0.3188, pruned_loss=0.1048, over 21650.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3037, pruned_loss=0.07359, over 4271383.38 frames. ], batch size: 471, lr: 4.29e-03, grad_scale: 16.0 2023-06-24 23:05:14,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1202862.0, ans=0.125 2023-06-24 23:05:46,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1202922.0, ans=0.125 2023-06-24 23:06:04,513 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 2.853e+02 3.403e+02 4.672e+02 8.323e+02, threshold=6.806e+02, percent-clipped=1.0 2023-06-24 23:06:19,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1203042.0, ans=0.0 2023-06-24 23:06:22,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1203042.0, ans=0.0 2023-06-24 23:06:35,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=10.25 vs. limit=15.0 2023-06-24 23:06:44,180 INFO [train.py:996] (3/4) Epoch 7, batch 17550, loss[loss=0.2125, simple_loss=0.3047, pruned_loss=0.06012, over 21644.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3025, pruned_loss=0.07195, over 4269369.55 frames. ], batch size: 230, lr: 4.28e-03, grad_scale: 8.0 2023-06-24 23:06:58,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1203102.0, ans=0.125 2023-06-24 23:07:39,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1203222.0, ans=0.125 2023-06-24 23:08:32,338 INFO [train.py:996] (3/4) Epoch 7, batch 17600, loss[loss=0.2395, simple_loss=0.32, pruned_loss=0.07945, over 21564.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3064, pruned_loss=0.07275, over 4258121.78 frames. ], batch size: 389, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:08:56,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1203462.0, ans=0.1 2023-06-24 23:09:04,015 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.88 vs. limit=15.0 2023-06-24 23:09:04,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1203462.0, ans=0.0 2023-06-24 23:09:13,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1203522.0, ans=0.125 2023-06-24 23:09:16,084 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-06-24 23:09:18,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1203522.0, ans=0.035 2023-06-24 23:09:41,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.778e+02 3.294e+02 4.134e+02 8.304e+02, threshold=6.589e+02, percent-clipped=2.0 2023-06-24 23:10:20,631 INFO [train.py:996] (3/4) Epoch 7, batch 17650, loss[loss=0.1673, simple_loss=0.2287, pruned_loss=0.05293, over 21689.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3043, pruned_loss=0.07315, over 4263670.38 frames. ], batch size: 112, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:10:31,780 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:11:06,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1203822.0, ans=0.2 2023-06-24 23:11:26,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1203882.0, ans=0.125 2023-06-24 23:12:10,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1204002.0, ans=0.125 2023-06-24 23:12:12,062 INFO [train.py:996] (3/4) Epoch 7, batch 17700, loss[loss=0.264, simple_loss=0.3453, pruned_loss=0.09132, over 21915.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2974, pruned_loss=0.07053, over 4257161.21 frames. ], batch size: 372, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:12:46,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1204062.0, ans=0.5 2023-06-24 23:13:07,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1204122.0, ans=0.125 2023-06-24 23:13:30,878 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.965e+02 3.854e+02 5.323e+02 9.978e+02, threshold=7.709e+02, percent-clipped=16.0 2023-06-24 23:13:41,734 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=15.0 2023-06-24 23:13:44,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1204242.0, ans=0.125 2023-06-24 23:14:05,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1204302.0, ans=0.125 2023-06-24 23:14:06,514 INFO [train.py:996] (3/4) Epoch 7, batch 17750, loss[loss=0.2469, simple_loss=0.3233, pruned_loss=0.08521, over 21293.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3044, pruned_loss=0.07374, over 4262034.15 frames. ], batch size: 159, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:14:08,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1204302.0, ans=0.07 2023-06-24 23:14:26,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=12.0 2023-06-24 23:14:35,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1204362.0, ans=0.125 2023-06-24 23:15:24,598 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:15:39,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1204542.0, ans=0.2 2023-06-24 23:15:52,576 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-24 23:15:56,636 INFO [train.py:996] (3/4) Epoch 7, batch 17800, loss[loss=0.2468, simple_loss=0.3374, pruned_loss=0.07806, over 21286.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.304, pruned_loss=0.07257, over 4264554.35 frames. ], batch size: 549, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:16:40,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1204722.0, ans=0.0 2023-06-24 23:17:20,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1204782.0, ans=0.125 2023-06-24 23:17:23,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.835e+02 3.431e+02 4.472e+02 1.183e+03, threshold=6.863e+02, percent-clipped=3.0 2023-06-24 23:17:35,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1204842.0, ans=0.125 2023-06-24 23:17:41,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1204842.0, ans=0.125 2023-06-24 23:17:47,977 INFO [train.py:996] (3/4) Epoch 7, batch 17850, loss[loss=0.2458, simple_loss=0.318, pruned_loss=0.08683, over 21715.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3054, pruned_loss=0.07345, over 4266201.80 frames. ], batch size: 298, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:17:58,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-24 23:18:41,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1205022.0, ans=0.0 2023-06-24 23:19:15,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1205082.0, ans=0.2 2023-06-24 23:19:18,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1205142.0, ans=0.125 2023-06-24 23:19:20,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1205142.0, ans=0.2 2023-06-24 23:19:38,021 INFO [train.py:996] (3/4) Epoch 7, batch 17900, loss[loss=0.2145, simple_loss=0.3117, pruned_loss=0.05862, over 21659.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3107, pruned_loss=0.07541, over 4275281.13 frames. ], batch size: 263, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:19:50,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1205202.0, ans=0.1 2023-06-24 23:20:13,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1205262.0, ans=0.2 2023-06-24 23:20:26,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1205262.0, ans=0.0 2023-06-24 23:20:59,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1205382.0, ans=0.125 2023-06-24 23:21:03,391 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.987e+02 3.415e+02 4.264e+02 7.391e+02, threshold=6.831e+02, percent-clipped=3.0 2023-06-24 23:21:16,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1205442.0, ans=0.0 2023-06-24 23:21:18,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1205442.0, ans=10.0 2023-06-24 23:21:23,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1205442.0, ans=0.2 2023-06-24 23:21:25,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-24 23:21:27,874 INFO [train.py:996] (3/4) Epoch 7, batch 17950, loss[loss=0.2334, simple_loss=0.3217, pruned_loss=0.07259, over 21471.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3088, pruned_loss=0.07194, over 4278459.95 frames. ], batch size: 471, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:22:11,620 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:22:18,599 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:22:55,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1205682.0, ans=0.125 2023-06-24 23:23:05,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1205742.0, ans=0.07 2023-06-24 23:23:14,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-24 23:23:19,951 INFO [train.py:996] (3/4) Epoch 7, batch 18000, loss[loss=0.2119, simple_loss=0.2788, pruned_loss=0.07248, over 21758.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3028, pruned_loss=0.07038, over 4277724.34 frames. ], batch size: 371, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:23:19,952 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-24 23:23:40,283 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2616, simple_loss=0.3599, pruned_loss=0.08162, over 1796401.00 frames. 2023-06-24 23:23:40,283 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23409MB 2023-06-24 23:24:27,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1205922.0, ans=0.125 2023-06-24 23:24:29,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-24 23:24:52,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1205982.0, ans=0.125 2023-06-24 23:24:55,040 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.910e+02 2.947e+02 3.493e+02 4.464e+02 9.866e+02, threshold=6.986e+02, percent-clipped=5.0 2023-06-24 23:25:35,631 INFO [train.py:996] (3/4) Epoch 7, batch 18050, loss[loss=0.2203, simple_loss=0.2981, pruned_loss=0.0712, over 21375.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2968, pruned_loss=0.07004, over 4278954.67 frames. ], batch size: 211, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:25:41,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-06-24 23:25:53,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1206102.0, ans=0.125 2023-06-24 23:27:32,132 INFO [train.py:996] (3/4) Epoch 7, batch 18100, loss[loss=0.2217, simple_loss=0.2991, pruned_loss=0.0722, over 19984.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3019, pruned_loss=0.07247, over 4272435.96 frames. ], batch size: 703, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:27:42,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1206402.0, ans=0.2 2023-06-24 23:28:07,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1206522.0, ans=0.125 2023-06-24 23:28:26,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1206582.0, ans=0.125 2023-06-24 23:28:44,810 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.881e+02 3.345e+02 4.009e+02 7.924e+02, threshold=6.690e+02, percent-clipped=2.0 2023-06-24 23:29:11,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1206642.0, ans=0.2 2023-06-24 23:29:14,113 INFO [train.py:996] (3/4) Epoch 7, batch 18150, loss[loss=0.2097, simple_loss=0.2823, pruned_loss=0.06849, over 21811.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3037, pruned_loss=0.07247, over 4266071.87 frames. ], batch size: 317, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:29:26,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1206702.0, ans=0.0 2023-06-24 23:29:52,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1206822.0, ans=0.0 2023-06-24 23:30:31,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1206882.0, ans=0.125 2023-06-24 23:30:32,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-24 23:30:53,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1206942.0, ans=0.125 2023-06-24 23:30:59,303 INFO [train.py:996] (3/4) Epoch 7, batch 18200, loss[loss=0.1989, simple_loss=0.2687, pruned_loss=0.06453, over 21891.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2975, pruned_loss=0.07244, over 4260199.11 frames. ], batch size: 98, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:31:21,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1207062.0, ans=0.0 2023-06-24 23:31:26,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1207062.0, ans=0.2 2023-06-24 23:32:05,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.874e+02 3.635e+02 5.188e+02 1.150e+03, threshold=7.270e+02, percent-clipped=9.0 2023-06-24 23:32:38,605 INFO [train.py:996] (3/4) Epoch 7, batch 18250, loss[loss=0.2134, simple_loss=0.2878, pruned_loss=0.06952, over 21753.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2897, pruned_loss=0.06988, over 4264079.22 frames. ], batch size: 112, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:32:40,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1207302.0, ans=0.125 2023-06-24 23:32:42,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1207302.0, ans=0.95 2023-06-24 23:32:42,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1207302.0, ans=0.125 2023-06-24 23:33:24,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1207422.0, ans=0.0 2023-06-24 23:33:38,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1207482.0, ans=0.0 2023-06-24 23:33:53,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-24 23:33:56,496 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-24 23:34:24,270 INFO [train.py:996] (3/4) Epoch 7, batch 18300, loss[loss=0.2333, simple_loss=0.3297, pruned_loss=0.06844, over 21738.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2909, pruned_loss=0.07059, over 4266738.12 frames. ], batch size: 247, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:34:24,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1207602.0, ans=0.125 2023-06-24 23:34:28,332 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:34:46,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1207662.0, ans=10.0 2023-06-24 23:34:53,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1207662.0, ans=0.0 2023-06-24 23:34:58,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1207662.0, ans=0.125 2023-06-24 23:35:06,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1207662.0, ans=0.125 2023-06-24 23:35:26,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1207782.0, ans=0.1 2023-06-24 23:35:39,690 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.910e+02 3.541e+02 4.206e+02 1.059e+03, threshold=7.082e+02, percent-clipped=3.0 2023-06-24 23:36:12,376 INFO [train.py:996] (3/4) Epoch 7, batch 18350, loss[loss=0.2163, simple_loss=0.3209, pruned_loss=0.05591, over 21593.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2965, pruned_loss=0.06986, over 4264690.17 frames. ], batch size: 230, lr: 4.28e-03, grad_scale: 16.0 2023-06-24 23:36:22,371 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=12.0 2023-06-24 23:36:39,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1207962.0, ans=0.125 2023-06-24 23:37:25,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1208082.0, ans=0.125 2023-06-24 23:37:27,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1208082.0, ans=0.125 2023-06-24 23:38:01,015 INFO [train.py:996] (3/4) Epoch 7, batch 18400, loss[loss=0.1849, simple_loss=0.2704, pruned_loss=0.0497, over 21782.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2906, pruned_loss=0.06831, over 4250447.28 frames. ], batch size: 352, lr: 4.28e-03, grad_scale: 32.0 2023-06-24 23:39:16,997 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 2.559e+02 3.009e+02 3.655e+02 5.951e+02, threshold=6.019e+02, percent-clipped=0.0 2023-06-24 23:39:19,397 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:39:29,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1208442.0, ans=0.0 2023-06-24 23:39:49,310 INFO [train.py:996] (3/4) Epoch 7, batch 18450, loss[loss=0.1867, simple_loss=0.2747, pruned_loss=0.04933, over 21274.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2874, pruned_loss=0.0649, over 4240841.31 frames. ], batch size: 551, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:39:50,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-24 23:40:24,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1208562.0, ans=0.125 2023-06-24 23:40:29,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208562.0, ans=0.1 2023-06-24 23:40:35,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1208622.0, ans=0.2 2023-06-24 23:40:40,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208622.0, ans=0.1 2023-06-24 23:40:42,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.18 vs. limit=10.0 2023-06-24 23:41:38,242 INFO [train.py:996] (3/4) Epoch 7, batch 18500, loss[loss=0.1909, simple_loss=0.2827, pruned_loss=0.04952, over 21712.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2834, pruned_loss=0.06432, over 4246984.98 frames. ], batch size: 332, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:41:38,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1208802.0, ans=0.0 2023-06-24 23:41:59,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1208862.0, ans=0.1 2023-06-24 23:42:01,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1208862.0, ans=0.0 2023-06-24 23:42:03,607 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.79 vs. limit=10.0 2023-06-24 23:42:25,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=22.5 2023-06-24 23:42:38,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.27 vs. limit=15.0 2023-06-24 23:42:53,739 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=12.0 2023-06-24 23:42:59,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.887e+02 2.868e+02 3.588e+02 5.410e+02 1.340e+03, threshold=7.175e+02, percent-clipped=18.0 2023-06-24 23:43:25,441 INFO [train.py:996] (3/4) Epoch 7, batch 18550, loss[loss=0.1969, simple_loss=0.2513, pruned_loss=0.07126, over 17038.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2811, pruned_loss=0.06381, over 4247941.35 frames. ], batch size: 67, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:44:23,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1209222.0, ans=0.0 2023-06-24 23:44:29,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1209282.0, ans=0.0 2023-06-24 23:44:50,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1209342.0, ans=10.0 2023-06-24 23:45:13,359 INFO [train.py:996] (3/4) Epoch 7, batch 18600, loss[loss=0.2012, simple_loss=0.2781, pruned_loss=0.06213, over 21698.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2789, pruned_loss=0.06468, over 4223753.60 frames. ], batch size: 333, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:45:26,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1209402.0, ans=0.1 2023-06-24 23:45:49,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1209462.0, ans=0.025 2023-06-24 23:46:35,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.703e+02 3.435e+02 4.233e+02 7.811e+02, threshold=6.869e+02, percent-clipped=3.0 2023-06-24 23:47:01,133 INFO [train.py:996] (3/4) Epoch 7, batch 18650, loss[loss=0.2526, simple_loss=0.3127, pruned_loss=0.09629, over 21430.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2784, pruned_loss=0.06496, over 4207430.13 frames. ], batch size: 473, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:47:06,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1209702.0, ans=0.0 2023-06-24 23:47:22,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1209762.0, ans=0.125 2023-06-24 23:47:45,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1209822.0, ans=0.125 2023-06-24 23:47:54,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1209822.0, ans=0.2 2023-06-24 23:48:14,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-24 23:48:32,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1209942.0, ans=0.125 2023-06-24 23:48:48,669 INFO [train.py:996] (3/4) Epoch 7, batch 18700, loss[loss=0.1994, simple_loss=0.2701, pruned_loss=0.0643, over 21823.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2764, pruned_loss=0.06588, over 4216207.14 frames. ], batch size: 282, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:48:51,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1210002.0, ans=10.0 2023-06-24 23:50:10,825 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 2.803e+02 3.350e+02 3.905e+02 5.845e+02, threshold=6.700e+02, percent-clipped=0.0 2023-06-24 23:50:36,536 INFO [train.py:996] (3/4) Epoch 7, batch 18750, loss[loss=0.2571, simple_loss=0.3387, pruned_loss=0.08778, over 21382.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2786, pruned_loss=0.06883, over 4230246.52 frames. ], batch size: 131, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:51:44,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1210482.0, ans=0.2 2023-06-24 23:52:01,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1210542.0, ans=0.1 2023-06-24 23:52:22,876 INFO [train.py:996] (3/4) Epoch 7, batch 18800, loss[loss=0.1991, simple_loss=0.2893, pruned_loss=0.05447, over 21757.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2851, pruned_loss=0.06986, over 4242751.94 frames. ], batch size: 351, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:53:10,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1210722.0, ans=0.125 2023-06-24 23:53:32,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1210782.0, ans=0.125 2023-06-24 23:53:43,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.643e+02 3.373e+02 4.457e+02 8.790e+02, threshold=6.746e+02, percent-clipped=4.0 2023-06-24 23:54:09,224 INFO [train.py:996] (3/4) Epoch 7, batch 18850, loss[loss=0.1827, simple_loss=0.2826, pruned_loss=0.04136, over 21600.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2832, pruned_loss=0.06532, over 4255798.23 frames. ], batch size: 389, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:54:42,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.58 vs. limit=15.0 2023-06-24 23:54:59,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1211022.0, ans=0.0 2023-06-24 23:55:09,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=22.5 2023-06-24 23:55:10,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1211082.0, ans=0.125 2023-06-24 23:55:56,272 INFO [train.py:996] (3/4) Epoch 7, batch 18900, loss[loss=0.1643, simple_loss=0.2406, pruned_loss=0.04398, over 21281.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2815, pruned_loss=0.06523, over 4243507.77 frames. ], batch size: 176, lr: 4.27e-03, grad_scale: 32.0 2023-06-24 23:56:50,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1211322.0, ans=0.0 2023-06-24 23:56:52,857 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-24 23:56:53,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1211322.0, ans=0.2 2023-06-24 23:57:06,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1211382.0, ans=0.0 2023-06-24 23:57:09,653 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:57:09,673 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:57:17,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.856e+02 2.759e+02 3.206e+02 4.379e+02 8.069e+02, threshold=6.411e+02, percent-clipped=2.0 2023-06-24 23:57:31,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=22.5 2023-06-24 23:57:39,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1211442.0, ans=0.125 2023-06-24 23:57:44,051 INFO [train.py:996] (3/4) Epoch 7, batch 18950, loss[loss=0.225, simple_loss=0.3002, pruned_loss=0.07488, over 21662.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2808, pruned_loss=0.06716, over 4248601.93 frames. ], batch size: 263, lr: 4.27e-03, grad_scale: 16.0 2023-06-24 23:58:53,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1211682.0, ans=0.125 2023-06-24 23:59:25,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1211742.0, ans=0.2 2023-06-24 23:59:39,027 INFO [train.py:996] (3/4) Epoch 7, batch 19000, loss[loss=0.2434, simple_loss=0.3223, pruned_loss=0.08224, over 21444.00 frames. ], tot_loss[loss=0.213, simple_loss=0.289, pruned_loss=0.06854, over 4255425.01 frames. ], batch size: 211, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:01:02,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.437e+02 3.130e+02 3.898e+02 4.619e+02 8.945e+02, threshold=7.797e+02, percent-clipped=5.0 2023-06-25 00:01:26,861 INFO [train.py:996] (3/4) Epoch 7, batch 19050, loss[loss=0.2484, simple_loss=0.3132, pruned_loss=0.09176, over 21841.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2951, pruned_loss=0.07264, over 4259578.97 frames. ], batch size: 351, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:01:45,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1212162.0, ans=0.1 2023-06-25 00:02:10,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1212222.0, ans=0.125 2023-06-25 00:03:13,219 INFO [train.py:996] (3/4) Epoch 7, batch 19100, loss[loss=0.1951, simple_loss=0.2631, pruned_loss=0.06356, over 21864.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.293, pruned_loss=0.07323, over 4270870.19 frames. ], batch size: 98, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:03:51,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1212462.0, ans=0.125 2023-06-25 00:03:53,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1212522.0, ans=0.0 2023-06-25 00:04:38,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.814e+02 3.416e+02 4.391e+02 9.529e+02, threshold=6.832e+02, percent-clipped=4.0 2023-06-25 00:04:39,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1212642.0, ans=0.0 2023-06-25 00:04:49,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1212642.0, ans=0.2 2023-06-25 00:05:04,646 INFO [train.py:996] (3/4) Epoch 7, batch 19150, loss[loss=0.2369, simple_loss=0.3338, pruned_loss=0.07001, over 21700.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2974, pruned_loss=0.07465, over 4271993.86 frames. ], batch size: 247, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:07:00,454 INFO [train.py:996] (3/4) Epoch 7, batch 19200, loss[loss=0.2451, simple_loss=0.3426, pruned_loss=0.07378, over 21802.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3053, pruned_loss=0.07483, over 4265848.16 frames. ], batch size: 332, lr: 4.27e-03, grad_scale: 32.0 2023-06-25 00:07:25,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1213062.0, ans=0.125 2023-06-25 00:07:57,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-25 00:08:14,504 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-25 00:08:23,260 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-25 00:08:23,635 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 3.201e+02 4.532e+02 8.099e+02 1.362e+03, threshold=9.063e+02, percent-clipped=31.0 2023-06-25 00:08:48,663 INFO [train.py:996] (3/4) Epoch 7, batch 19250, loss[loss=0.2211, simple_loss=0.3018, pruned_loss=0.07018, over 21607.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3075, pruned_loss=0.0711, over 4265316.93 frames. ], batch size: 471, lr: 4.27e-03, grad_scale: 32.0 2023-06-25 00:10:08,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-25 00:10:29,816 INFO [train.py:996] (3/4) Epoch 7, batch 19300, loss[loss=0.2031, simple_loss=0.2865, pruned_loss=0.05985, over 21679.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.304, pruned_loss=0.07065, over 4277902.56 frames. ], batch size: 389, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:11:55,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1213782.0, ans=0.1 2023-06-25 00:12:02,041 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 2.613e+02 3.067e+02 3.986e+02 9.865e+02, threshold=6.134e+02, percent-clipped=1.0 2023-06-25 00:12:24,999 INFO [train.py:996] (3/4) Epoch 7, batch 19350, loss[loss=0.207, simple_loss=0.298, pruned_loss=0.05803, over 21560.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2993, pruned_loss=0.06825, over 4279502.20 frames. ], batch size: 441, lr: 4.27e-03, grad_scale: 16.0 2023-06-25 00:12:44,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1213902.0, ans=0.0 2023-06-25 00:13:10,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1214022.0, ans=0.125 2023-06-25 00:13:16,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1214022.0, ans=0.125 2023-06-25 00:13:24,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1214082.0, ans=0.125 2023-06-25 00:13:24,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1214082.0, ans=0.1 2023-06-25 00:13:36,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1214082.0, ans=0.035 2023-06-25 00:14:11,266 INFO [train.py:996] (3/4) Epoch 7, batch 19400, loss[loss=0.1895, simple_loss=0.2849, pruned_loss=0.04702, over 19877.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2969, pruned_loss=0.06714, over 4284230.12 frames. ], batch size: 703, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:14:51,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1214262.0, ans=0.0 2023-06-25 00:14:54,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1214322.0, ans=0.1 2023-06-25 00:14:58,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1214322.0, ans=0.1 2023-06-25 00:15:28,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1214382.0, ans=0.0 2023-06-25 00:15:34,735 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.871e+02 3.427e+02 4.239e+02 8.208e+02, threshold=6.853e+02, percent-clipped=6.0 2023-06-25 00:15:58,297 INFO [train.py:996] (3/4) Epoch 7, batch 19450, loss[loss=0.2425, simple_loss=0.3022, pruned_loss=0.09139, over 21859.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2945, pruned_loss=0.06924, over 4290359.14 frames. ], batch size: 98, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:16:17,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1214502.0, ans=0.05 2023-06-25 00:16:34,873 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:17:22,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1214742.0, ans=0.125 2023-06-25 00:17:46,977 INFO [train.py:996] (3/4) Epoch 7, batch 19500, loss[loss=0.2017, simple_loss=0.276, pruned_loss=0.06377, over 21641.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2903, pruned_loss=0.06978, over 4283665.26 frames. ], batch size: 263, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:18:23,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1214862.0, ans=15.0 2023-06-25 00:18:27,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1214862.0, ans=0.0 2023-06-25 00:18:50,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1214922.0, ans=0.2 2023-06-25 00:19:04,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1214982.0, ans=0.125 2023-06-25 00:19:14,505 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 2.919e+02 3.343e+02 4.176e+02 7.589e+02, threshold=6.686e+02, percent-clipped=2.0 2023-06-25 00:19:30,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1215042.0, ans=0.125 2023-06-25 00:19:33,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1215042.0, ans=0.125 2023-06-25 00:19:36,599 INFO [train.py:996] (3/4) Epoch 7, batch 19550, loss[loss=0.1635, simple_loss=0.2209, pruned_loss=0.05303, over 21901.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2841, pruned_loss=0.06774, over 4268920.28 frames. ], batch size: 107, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:19:37,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1215102.0, ans=0.125 2023-06-25 00:19:37,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1215102.0, ans=0.0 2023-06-25 00:19:43,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1215102.0, ans=0.125 2023-06-25 00:20:04,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1215162.0, ans=0.2 2023-06-25 00:20:04,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1215162.0, ans=0.125 2023-06-25 00:21:07,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1215342.0, ans=0.0 2023-06-25 00:21:22,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1215342.0, ans=0.125 2023-06-25 00:21:26,525 INFO [train.py:996] (3/4) Epoch 7, batch 19600, loss[loss=0.205, simple_loss=0.2789, pruned_loss=0.06557, over 21812.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2855, pruned_loss=0.06824, over 4276299.47 frames. ], batch size: 247, lr: 4.26e-03, grad_scale: 32.0 2023-06-25 00:22:06,454 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:22:33,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-25 00:22:52,375 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.425e+02 3.092e+02 3.648e+02 4.642e+02 7.608e+02, threshold=7.295e+02, percent-clipped=3.0 2023-06-25 00:23:06,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1215642.0, ans=0.125 2023-06-25 00:23:07,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1215642.0, ans=0.0 2023-06-25 00:23:19,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.51 vs. limit=15.0 2023-06-25 00:23:21,450 INFO [train.py:996] (3/4) Epoch 7, batch 19650, loss[loss=0.2921, simple_loss=0.3408, pruned_loss=0.1217, over 21628.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2915, pruned_loss=0.07252, over 4281345.86 frames. ], batch size: 510, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:23:28,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1215702.0, ans=0.1 2023-06-25 00:23:33,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-25 00:24:26,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1215882.0, ans=0.125 2023-06-25 00:25:19,691 INFO [train.py:996] (3/4) Epoch 7, batch 19700, loss[loss=0.2231, simple_loss=0.3233, pruned_loss=0.0614, over 21185.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2961, pruned_loss=0.0735, over 4282867.36 frames. ], batch size: 548, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:25:20,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1216002.0, ans=0.125 2023-06-25 00:25:24,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1216002.0, ans=0.2 2023-06-25 00:26:17,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1216122.0, ans=0.04949747468305833 2023-06-25 00:26:41,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1216182.0, ans=0.125 2023-06-25 00:26:53,998 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.233e+02 3.060e+02 3.533e+02 4.552e+02 9.773e+02, threshold=7.066e+02, percent-clipped=3.0 2023-06-25 00:27:03,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1216242.0, ans=0.125 2023-06-25 00:27:15,069 INFO [train.py:996] (3/4) Epoch 7, batch 19750, loss[loss=0.2901, simple_loss=0.3914, pruned_loss=0.09443, over 21868.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3052, pruned_loss=0.07486, over 4278811.89 frames. ], batch size: 372, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:29:02,167 INFO [train.py:996] (3/4) Epoch 7, batch 19800, loss[loss=0.186, simple_loss=0.2544, pruned_loss=0.05877, over 21648.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.306, pruned_loss=0.07548, over 4271943.51 frames. ], batch size: 195, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:29:52,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1216722.0, ans=10.0 2023-06-25 00:30:30,873 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.745e+02 3.353e+02 4.359e+02 1.129e+03, threshold=6.706e+02, percent-clipped=10.0 2023-06-25 00:30:52,374 INFO [train.py:996] (3/4) Epoch 7, batch 19850, loss[loss=0.1819, simple_loss=0.2663, pruned_loss=0.0487, over 21775.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2986, pruned_loss=0.07048, over 4268714.70 frames. ], batch size: 316, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:31:08,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1216962.0, ans=0.1 2023-06-25 00:31:08,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1216962.0, ans=0.04949747468305833 2023-06-25 00:31:12,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1216962.0, ans=0.125 2023-06-25 00:31:34,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.60 vs. limit=10.0 2023-06-25 00:31:37,714 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-25 00:32:39,665 INFO [train.py:996] (3/4) Epoch 7, batch 19900, loss[loss=0.1864, simple_loss=0.2586, pruned_loss=0.05705, over 21337.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2977, pruned_loss=0.06763, over 4269483.24 frames. ], batch size: 211, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:33:31,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1217322.0, ans=0.1 2023-06-25 00:34:03,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-25 00:34:12,736 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.818e+02 3.439e+02 4.122e+02 9.461e+02, threshold=6.879e+02, percent-clipped=3.0 2023-06-25 00:34:16,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1217442.0, ans=0.125 2023-06-25 00:34:28,705 INFO [train.py:996] (3/4) Epoch 7, batch 19950, loss[loss=0.1879, simple_loss=0.3044, pruned_loss=0.03571, over 19782.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2932, pruned_loss=0.06734, over 4261971.81 frames. ], batch size: 702, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:34:31,552 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.02 vs. limit=10.0 2023-06-25 00:35:09,843 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-25 00:35:41,799 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=22.5 2023-06-25 00:36:17,076 INFO [train.py:996] (3/4) Epoch 7, batch 20000, loss[loss=0.2123, simple_loss=0.2924, pruned_loss=0.06613, over 21497.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2937, pruned_loss=0.06764, over 4263775.75 frames. ], batch size: 195, lr: 4.26e-03, grad_scale: 32.0 2023-06-25 00:36:54,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1217862.0, ans=0.125 2023-06-25 00:37:47,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.923e+02 3.292e+02 4.012e+02 7.608e+02, threshold=6.584e+02, percent-clipped=1.0 2023-06-25 00:38:03,216 INFO [train.py:996] (3/4) Epoch 7, batch 20050, loss[loss=0.2224, simple_loss=0.3029, pruned_loss=0.07092, over 21617.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2957, pruned_loss=0.07051, over 4275439.77 frames. ], batch size: 230, lr: 4.26e-03, grad_scale: 32.0 2023-06-25 00:38:30,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1218162.0, ans=0.125 2023-06-25 00:39:14,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1218282.0, ans=0.125 2023-06-25 00:39:45,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1218342.0, ans=0.0 2023-06-25 00:39:53,777 INFO [train.py:996] (3/4) Epoch 7, batch 20100, loss[loss=0.2275, simple_loss=0.2834, pruned_loss=0.08579, over 21600.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2988, pruned_loss=0.073, over 4274480.44 frames. ], batch size: 548, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:40:16,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1218462.0, ans=0.1 2023-06-25 00:41:29,509 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 2.968e+02 3.649e+02 4.781e+02 8.701e+02, threshold=7.299e+02, percent-clipped=5.0 2023-06-25 00:41:49,523 INFO [train.py:996] (3/4) Epoch 7, batch 20150, loss[loss=0.2764, simple_loss=0.3408, pruned_loss=0.106, over 21773.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.306, pruned_loss=0.07548, over 4271925.54 frames. ], batch size: 441, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:41:54,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1218702.0, ans=0.2 2023-06-25 00:42:59,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1218882.0, ans=0.2 2023-06-25 00:43:17,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1218942.0, ans=0.1 2023-06-25 00:43:21,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=22.5 2023-06-25 00:43:51,462 INFO [train.py:996] (3/4) Epoch 7, batch 20200, loss[loss=0.2488, simple_loss=0.3778, pruned_loss=0.0599, over 19913.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3108, pruned_loss=0.07817, over 4261767.21 frames. ], batch size: 702, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:44:09,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=15.0 2023-06-25 00:44:27,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1219062.0, ans=0.0 2023-06-25 00:44:58,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.16 vs. limit=10.0 2023-06-25 00:45:22,363 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.473e+02 3.331e+02 3.947e+02 5.099e+02 9.386e+02, threshold=7.894e+02, percent-clipped=7.0 2023-06-25 00:45:34,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1219302.0, ans=0.0 2023-06-25 00:45:36,292 INFO [train.py:996] (3/4) Epoch 7, batch 20250, loss[loss=0.2267, simple_loss=0.2997, pruned_loss=0.07685, over 21794.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3105, pruned_loss=0.07575, over 4254967.70 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:46:06,734 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-25 00:46:09,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1219362.0, ans=0.2 2023-06-25 00:46:33,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1219482.0, ans=0.0 2023-06-25 00:46:35,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-25 00:46:38,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1219482.0, ans=0.125 2023-06-25 00:47:06,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1219542.0, ans=0.125 2023-06-25 00:47:24,409 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-25 00:47:25,049 INFO [train.py:996] (3/4) Epoch 7, batch 20300, loss[loss=0.2082, simple_loss=0.2972, pruned_loss=0.05963, over 21731.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3076, pruned_loss=0.07253, over 4255055.58 frames. ], batch size: 332, lr: 4.26e-03, grad_scale: 16.0 2023-06-25 00:47:51,885 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-25 00:48:01,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1219662.0, ans=0.0 2023-06-25 00:48:15,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1219722.0, ans=0.125 2023-06-25 00:48:52,965 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.615e+02 3.044e+02 3.787e+02 8.411e+02, threshold=6.088e+02, percent-clipped=1.0 2023-06-25 00:49:02,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1219842.0, ans=0.0 2023-06-25 00:49:11,886 INFO [train.py:996] (3/4) Epoch 7, batch 20350, loss[loss=0.2366, simple_loss=0.3004, pruned_loss=0.08644, over 21324.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.308, pruned_loss=0.07319, over 4254620.91 frames. ], batch size: 143, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:49:58,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1220022.0, ans=0.125 2023-06-25 00:50:56,352 INFO [train.py:996] (3/4) Epoch 7, batch 20400, loss[loss=0.2622, simple_loss=0.3387, pruned_loss=0.09281, over 21895.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3112, pruned_loss=0.07621, over 4253321.00 frames. ], batch size: 371, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 00:51:02,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1220202.0, ans=0.0 2023-06-25 00:51:44,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1220322.0, ans=0.2 2023-06-25 00:52:32,830 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 3.347e+02 3.963e+02 4.819e+02 8.468e+02, threshold=7.927e+02, percent-clipped=6.0 2023-06-25 00:52:36,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1220442.0, ans=0.125 2023-06-25 00:52:41,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1220442.0, ans=0.125 2023-06-25 00:52:44,833 INFO [train.py:996] (3/4) Epoch 7, batch 20450, loss[loss=0.2572, simple_loss=0.3258, pruned_loss=0.09434, over 21511.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3125, pruned_loss=0.07825, over 4254332.50 frames. ], batch size: 131, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:52:50,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1220502.0, ans=0.125 2023-06-25 00:53:11,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1220562.0, ans=0.02 2023-06-25 00:53:41,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1220682.0, ans=0.1 2023-06-25 00:53:42,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1220682.0, ans=0.04949747468305833 2023-06-25 00:53:54,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1220682.0, ans=0.125 2023-06-25 00:54:25,821 INFO [train.py:996] (3/4) Epoch 7, batch 20500, loss[loss=0.225, simple_loss=0.2873, pruned_loss=0.08141, over 21355.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3075, pruned_loss=0.07784, over 4258266.88 frames. ], batch size: 144, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:54:52,657 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:55:29,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1220982.0, ans=0.1 2023-06-25 00:56:00,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.204e+02 4.054e+02 5.426e+02 8.867e+02, threshold=8.109e+02, percent-clipped=2.0 2023-06-25 00:56:04,709 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:56:08,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1221042.0, ans=0.0 2023-06-25 00:56:13,071 INFO [train.py:996] (3/4) Epoch 7, batch 20550, loss[loss=0.194, simple_loss=0.305, pruned_loss=0.04153, over 19824.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3005, pruned_loss=0.07569, over 4261164.26 frames. ], batch size: 703, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:56:33,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1221102.0, ans=0.2 2023-06-25 00:57:08,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1221222.0, ans=0.125 2023-06-25 00:57:22,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1221282.0, ans=0.0 2023-06-25 00:57:41,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1221342.0, ans=0.125 2023-06-25 00:57:46,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1221342.0, ans=0.2 2023-06-25 00:57:56,518 INFO [train.py:996] (3/4) Epoch 7, batch 20600, loss[loss=0.2224, simple_loss=0.3013, pruned_loss=0.07176, over 21293.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3023, pruned_loss=0.07422, over 4239526.63 frames. ], batch size: 176, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 00:59:08,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1221582.0, ans=0.125 2023-06-25 00:59:25,532 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 3.095e+02 3.828e+02 5.103e+02 1.106e+03, threshold=7.656e+02, percent-clipped=7.0 2023-06-25 00:59:28,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1221642.0, ans=0.125 2023-06-25 00:59:37,755 INFO [train.py:996] (3/4) Epoch 7, batch 20650, loss[loss=0.2029, simple_loss=0.2742, pruned_loss=0.06578, over 21650.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2996, pruned_loss=0.07449, over 4248652.39 frames. ], batch size: 332, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:00:07,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1221762.0, ans=0.0 2023-06-25 01:00:18,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1221822.0, ans=0.125 2023-06-25 01:00:36,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-25 01:01:01,288 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-25 01:01:15,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1221942.0, ans=0.125 2023-06-25 01:01:26,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1222002.0, ans=0.0 2023-06-25 01:01:32,865 INFO [train.py:996] (3/4) Epoch 7, batch 20700, loss[loss=0.2248, simple_loss=0.2911, pruned_loss=0.07923, over 20063.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2939, pruned_loss=0.07245, over 4245503.80 frames. ], batch size: 702, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:01:33,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1222002.0, ans=0.0 2023-06-25 01:01:43,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1222002.0, ans=0.125 2023-06-25 01:01:48,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1222062.0, ans=0.125 2023-06-25 01:02:15,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=22.5 2023-06-25 01:02:27,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1222122.0, ans=0.0 2023-06-25 01:03:03,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1222242.0, ans=0.0 2023-06-25 01:03:06,665 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 2.936e+02 3.801e+02 5.565e+02 1.085e+03, threshold=7.602e+02, percent-clipped=14.0 2023-06-25 01:03:24,052 INFO [train.py:996] (3/4) Epoch 7, batch 20750, loss[loss=0.2725, simple_loss=0.3979, pruned_loss=0.07353, over 20855.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2963, pruned_loss=0.0725, over 4245452.80 frames. ], batch size: 607, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:04:19,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1222422.0, ans=0.0 2023-06-25 01:04:24,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1222422.0, ans=0.125 2023-06-25 01:04:26,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1222482.0, ans=0.125 2023-06-25 01:04:38,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1222482.0, ans=0.2 2023-06-25 01:04:39,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-25 01:04:54,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-25 01:05:07,456 INFO [train.py:996] (3/4) Epoch 7, batch 20800, loss[loss=0.1993, simple_loss=0.2686, pruned_loss=0.06494, over 21833.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2984, pruned_loss=0.07353, over 4251206.23 frames. ], batch size: 318, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:05:08,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1222602.0, ans=0.1 2023-06-25 01:05:09,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1222602.0, ans=0.125 2023-06-25 01:05:20,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1222602.0, ans=0.125 2023-06-25 01:05:26,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1222662.0, ans=0.125 2023-06-25 01:06:02,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1222722.0, ans=0.125 2023-06-25 01:06:14,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1222782.0, ans=0.125 2023-06-25 01:06:43,828 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.363e+02 3.312e+02 4.339e+02 6.808e+02 1.439e+03, threshold=8.678e+02, percent-clipped=19.0 2023-06-25 01:06:45,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-25 01:06:53,372 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-25 01:06:55,800 INFO [train.py:996] (3/4) Epoch 7, batch 20850, loss[loss=0.1722, simple_loss=0.2457, pruned_loss=0.04933, over 21572.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2934, pruned_loss=0.07201, over 4252884.57 frames. ], batch size: 263, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:07:34,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1223022.0, ans=0.0 2023-06-25 01:08:01,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.15 vs. limit=8.0 2023-06-25 01:08:03,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1223082.0, ans=0.125 2023-06-25 01:08:26,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1223142.0, ans=0.0 2023-06-25 01:08:44,702 INFO [train.py:996] (3/4) Epoch 7, batch 20900, loss[loss=0.2515, simple_loss=0.3757, pruned_loss=0.06364, over 19714.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2932, pruned_loss=0.0715, over 4251983.31 frames. ], batch size: 702, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:09:17,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1223262.0, ans=0.125 2023-06-25 01:09:24,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1223322.0, ans=0.125 2023-06-25 01:09:24,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1223322.0, ans=0.125 2023-06-25 01:09:52,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1223382.0, ans=0.1 2023-06-25 01:09:56,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1223382.0, ans=0.2 2023-06-25 01:10:19,931 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.894e+02 3.467e+02 4.402e+02 7.475e+02, threshold=6.935e+02, percent-clipped=1.0 2023-06-25 01:10:30,273 INFO [train.py:996] (3/4) Epoch 7, batch 20950, loss[loss=0.2356, simple_loss=0.302, pruned_loss=0.08457, over 21788.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2906, pruned_loss=0.06928, over 4253120.50 frames. ], batch size: 414, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:10:53,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1223562.0, ans=0.0 2023-06-25 01:11:40,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1223682.0, ans=0.2 2023-06-25 01:12:09,740 INFO [train.py:996] (3/4) Epoch 7, batch 21000, loss[loss=0.1963, simple_loss=0.2739, pruned_loss=0.05939, over 21870.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2881, pruned_loss=0.06956, over 4262032.52 frames. ], batch size: 98, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:12:09,740 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 01:12:24,880 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.6349, 4.7218, 4.4729, 4.2936], device='cuda:3') 2023-06-25 01:12:27,621 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2666, simple_loss=0.3633, pruned_loss=0.08493, over 1796401.00 frames. 2023-06-25 01:12:27,622 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23564MB 2023-06-25 01:12:33,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1223802.0, ans=0.2 2023-06-25 01:13:45,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1223982.0, ans=0.07 2023-06-25 01:14:00,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1224042.0, ans=0.2 2023-06-25 01:14:06,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.703e+02 3.087e+02 3.976e+02 6.503e+02, threshold=6.174e+02, percent-clipped=0.0 2023-06-25 01:14:17,188 INFO [train.py:996] (3/4) Epoch 7, batch 21050, loss[loss=0.1963, simple_loss=0.2654, pruned_loss=0.06364, over 21613.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2857, pruned_loss=0.06996, over 4264110.64 frames. ], batch size: 282, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:14:21,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1224102.0, ans=0.0 2023-06-25 01:14:24,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1224102.0, ans=0.1 2023-06-25 01:14:50,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-25 01:15:46,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1224342.0, ans=0.0 2023-06-25 01:15:48,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1224342.0, ans=0.125 2023-06-25 01:16:05,200 INFO [train.py:996] (3/4) Epoch 7, batch 21100, loss[loss=0.2261, simple_loss=0.275, pruned_loss=0.08857, over 21241.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2816, pruned_loss=0.06882, over 4258326.21 frames. ], batch size: 471, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:16:20,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1224462.0, ans=0.0 2023-06-25 01:17:13,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1224582.0, ans=0.125 2023-06-25 01:17:42,064 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.657e+02 3.143e+02 4.101e+02 9.163e+02, threshold=6.287e+02, percent-clipped=4.0 2023-06-25 01:17:52,618 INFO [train.py:996] (3/4) Epoch 7, batch 21150, loss[loss=0.2077, simple_loss=0.2847, pruned_loss=0.06529, over 15031.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2787, pruned_loss=0.06877, over 4251302.41 frames. ], batch size: 60, lr: 4.25e-03, grad_scale: 16.0 2023-06-25 01:18:28,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.89 vs. limit=22.5 2023-06-25 01:18:50,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-25 01:19:39,290 INFO [train.py:996] (3/4) Epoch 7, batch 21200, loss[loss=0.1986, simple_loss=0.2618, pruned_loss=0.06764, over 21892.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2744, pruned_loss=0.0681, over 4245180.39 frames. ], batch size: 107, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:19:53,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=22.5 2023-06-25 01:19:58,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-25 01:20:31,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.82 vs. limit=6.0 2023-06-25 01:20:52,077 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-25 01:21:08,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1225242.0, ans=0.125 2023-06-25 01:21:17,807 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.659e+02 3.125e+02 3.870e+02 6.186e+02, threshold=6.250e+02, percent-clipped=0.0 2023-06-25 01:21:18,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1225242.0, ans=0.2 2023-06-25 01:21:25,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-25 01:21:28,366 INFO [train.py:996] (3/4) Epoch 7, batch 21250, loss[loss=0.2021, simple_loss=0.2866, pruned_loss=0.05877, over 21601.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2745, pruned_loss=0.06842, over 4245712.34 frames. ], batch size: 230, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:23:05,599 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:23:15,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-25 01:23:15,830 INFO [train.py:996] (3/4) Epoch 7, batch 21300, loss[loss=0.2157, simple_loss=0.2899, pruned_loss=0.07078, over 21912.00 frames. ], tot_loss[loss=0.212, simple_loss=0.282, pruned_loss=0.07098, over 4241504.59 frames. ], batch size: 316, lr: 4.25e-03, grad_scale: 32.0 2023-06-25 01:23:28,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-25 01:23:30,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1225602.0, ans=0.2 2023-06-25 01:24:15,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=12.0 2023-06-25 01:24:25,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1225782.0, ans=0.035 2023-06-25 01:24:45,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1225842.0, ans=0.0 2023-06-25 01:24:45,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1225842.0, ans=0.0 2023-06-25 01:24:55,313 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 2.894e+02 3.300e+02 4.575e+02 9.382e+02, threshold=6.600e+02, percent-clipped=9.0 2023-06-25 01:24:59,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1225842.0, ans=0.125 2023-06-25 01:25:04,014 INFO [train.py:996] (3/4) Epoch 7, batch 21350, loss[loss=0.1915, simple_loss=0.2898, pruned_loss=0.0466, over 21776.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2859, pruned_loss=0.07142, over 4250592.92 frames. ], batch size: 298, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:26:51,931 INFO [train.py:996] (3/4) Epoch 7, batch 21400, loss[loss=0.2154, simple_loss=0.2758, pruned_loss=0.07752, over 20142.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2877, pruned_loss=0.07033, over 4255798.18 frames. ], batch size: 702, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:26:58,000 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:27:47,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-25 01:28:18,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1226382.0, ans=0.2 2023-06-25 01:28:31,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.088e+02 4.012e+02 5.119e+02 7.296e+02, threshold=8.024e+02, percent-clipped=4.0 2023-06-25 01:28:32,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=22.5 2023-06-25 01:28:40,324 INFO [train.py:996] (3/4) Epoch 7, batch 21450, loss[loss=0.2121, simple_loss=0.2904, pruned_loss=0.06691, over 20649.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2912, pruned_loss=0.07211, over 4260237.69 frames. ], batch size: 607, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:29:43,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1226622.0, ans=0.2 2023-06-25 01:30:00,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1226682.0, ans=0.2 2023-06-25 01:30:09,993 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.19 vs. limit=12.0 2023-06-25 01:30:11,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1226742.0, ans=0.125 2023-06-25 01:30:28,733 INFO [train.py:996] (3/4) Epoch 7, batch 21500, loss[loss=0.2091, simple_loss=0.2763, pruned_loss=0.07088, over 21399.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2904, pruned_loss=0.07284, over 4265978.58 frames. ], batch size: 131, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:30:30,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1226802.0, ans=0.125 2023-06-25 01:30:40,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1226802.0, ans=0.1 2023-06-25 01:31:28,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1226922.0, ans=0.125 2023-06-25 01:31:41,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1226982.0, ans=0.0 2023-06-25 01:32:06,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 2.889e+02 3.383e+02 4.228e+02 8.142e+02, threshold=6.766e+02, percent-clipped=1.0 2023-06-25 01:32:14,667 INFO [train.py:996] (3/4) Epoch 7, batch 21550, loss[loss=0.1838, simple_loss=0.2595, pruned_loss=0.05403, over 21769.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2836, pruned_loss=0.07039, over 4269182.09 frames. ], batch size: 124, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:33:10,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1227222.0, ans=0.125 2023-06-25 01:33:36,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1227282.0, ans=0.2 2023-06-25 01:33:45,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1227342.0, ans=0.1 2023-06-25 01:33:59,005 INFO [train.py:996] (3/4) Epoch 7, batch 21600, loss[loss=0.199, simple_loss=0.2619, pruned_loss=0.06807, over 21805.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2783, pruned_loss=0.06928, over 4277596.59 frames. ], batch size: 352, lr: 4.24e-03, grad_scale: 32.0 2023-06-25 01:35:16,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1227582.0, ans=0.0 2023-06-25 01:35:39,753 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-25 01:35:40,096 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.809e+02 3.415e+02 4.856e+02 1.279e+03, threshold=6.830e+02, percent-clipped=8.0 2023-06-25 01:35:53,434 INFO [train.py:996] (3/4) Epoch 7, batch 21650, loss[loss=0.209, simple_loss=0.2952, pruned_loss=0.06136, over 21231.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2818, pruned_loss=0.06792, over 4273795.89 frames. ], batch size: 176, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:35:58,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1227702.0, ans=0.2 2023-06-25 01:37:14,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1227882.0, ans=0.2 2023-06-25 01:37:33,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1228002.0, ans=0.0 2023-06-25 01:37:34,935 INFO [train.py:996] (3/4) Epoch 7, batch 21700, loss[loss=0.2097, simple_loss=0.2756, pruned_loss=0.07187, over 21606.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2815, pruned_loss=0.06571, over 4276173.43 frames. ], batch size: 332, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:37:38,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1228002.0, ans=0.1 2023-06-25 01:37:38,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1228002.0, ans=0.2 2023-06-25 01:38:28,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1228122.0, ans=0.035 2023-06-25 01:39:04,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1228242.0, ans=0.125 2023-06-25 01:39:09,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1228242.0, ans=0.2 2023-06-25 01:39:14,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 3.013e+02 3.692e+02 5.814e+02 1.203e+03, threshold=7.384e+02, percent-clipped=13.0 2023-06-25 01:39:20,975 INFO [train.py:996] (3/4) Epoch 7, batch 21750, loss[loss=0.2421, simple_loss=0.2863, pruned_loss=0.09894, over 21240.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2791, pruned_loss=0.06599, over 4273753.67 frames. ], batch size: 471, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:39:23,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1228302.0, ans=0.0 2023-06-25 01:40:21,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1228422.0, ans=0.2 2023-06-25 01:40:35,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1228482.0, ans=0.125 2023-06-25 01:41:07,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1228602.0, ans=0.0 2023-06-25 01:41:08,617 INFO [train.py:996] (3/4) Epoch 7, batch 21800, loss[loss=0.2348, simple_loss=0.3059, pruned_loss=0.08187, over 21644.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2791, pruned_loss=0.0667, over 4276888.67 frames. ], batch size: 298, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:41:50,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.80 vs. limit=15.0 2023-06-25 01:42:02,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1228722.0, ans=0.125 2023-06-25 01:42:26,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-25 01:42:45,491 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=15.0 2023-06-25 01:42:45,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.478e+02 3.194e+02 4.069e+02 5.190e+02 9.750e+02, threshold=8.138e+02, percent-clipped=3.0 2023-06-25 01:42:53,026 INFO [train.py:996] (3/4) Epoch 7, batch 21850, loss[loss=0.2373, simple_loss=0.3107, pruned_loss=0.08198, over 21764.00 frames. ], tot_loss[loss=0.209, simple_loss=0.283, pruned_loss=0.06751, over 4274178.69 frames. ], batch size: 112, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:43:25,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1228962.0, ans=0.1 2023-06-25 01:44:37,557 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.79 vs. limit=10.0 2023-06-25 01:44:44,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-25 01:44:44,842 INFO [train.py:996] (3/4) Epoch 7, batch 21900, loss[loss=0.2005, simple_loss=0.2641, pruned_loss=0.06841, over 21481.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2853, pruned_loss=0.0693, over 4280156.30 frames. ], batch size: 194, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:45:02,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1229202.0, ans=0.125 2023-06-25 01:45:08,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1229262.0, ans=0.125 2023-06-25 01:45:09,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-25 01:45:45,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1229322.0, ans=0.1 2023-06-25 01:46:19,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.343e+02 2.991e+02 3.581e+02 4.789e+02 1.002e+03, threshold=7.161e+02, percent-clipped=1.0 2023-06-25 01:46:31,070 INFO [train.py:996] (3/4) Epoch 7, batch 21950, loss[loss=0.216, simple_loss=0.3229, pruned_loss=0.05449, over 20906.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2808, pruned_loss=0.06785, over 4272115.01 frames. ], batch size: 607, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:46:58,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.83 vs. limit=10.0 2023-06-25 01:47:07,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1229562.0, ans=0.1 2023-06-25 01:47:17,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1229622.0, ans=0.0 2023-06-25 01:47:34,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1229622.0, ans=0.2 2023-06-25 01:47:45,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 01:47:46,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1229682.0, ans=0.0 2023-06-25 01:47:59,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1229742.0, ans=0.125 2023-06-25 01:48:20,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1229742.0, ans=0.125 2023-06-25 01:48:26,799 INFO [train.py:996] (3/4) Epoch 7, batch 22000, loss[loss=0.2068, simple_loss=0.2737, pruned_loss=0.06994, over 21963.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2749, pruned_loss=0.06507, over 4269183.16 frames. ], batch size: 103, lr: 4.24e-03, grad_scale: 32.0 2023-06-25 01:49:14,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1229922.0, ans=0.0 2023-06-25 01:49:18,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1229922.0, ans=0.0 2023-06-25 01:49:30,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1229982.0, ans=0.0 2023-06-25 01:50:01,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1230042.0, ans=0.2 2023-06-25 01:50:12,189 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 3.193e+02 3.853e+02 5.102e+02 1.201e+03, threshold=7.707e+02, percent-clipped=7.0 2023-06-25 01:50:13,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=1230042.0, ans=12.0 2023-06-25 01:50:13,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=12.0 2023-06-25 01:50:17,727 INFO [train.py:996] (3/4) Epoch 7, batch 22050, loss[loss=0.262, simple_loss=0.3487, pruned_loss=0.08762, over 21848.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2815, pruned_loss=0.06746, over 4257917.34 frames. ], batch size: 372, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:51:04,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.90 vs. limit=5.0 2023-06-25 01:51:10,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1230222.0, ans=0.2 2023-06-25 01:51:15,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1230222.0, ans=0.125 2023-06-25 01:51:19,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1230222.0, ans=0.2 2023-06-25 01:51:19,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1230222.0, ans=0.125 2023-06-25 01:51:33,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1230282.0, ans=0.0 2023-06-25 01:51:48,215 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:52:06,966 INFO [train.py:996] (3/4) Epoch 7, batch 22100, loss[loss=0.2409, simple_loss=0.3084, pruned_loss=0.0867, over 21525.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2908, pruned_loss=0.0716, over 4252576.01 frames. ], batch size: 548, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:52:07,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1230402.0, ans=0.0 2023-06-25 01:52:49,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1230462.0, ans=0.125 2023-06-25 01:53:01,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1230522.0, ans=0.125 2023-06-25 01:53:49,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.658e+02 3.415e+02 4.118e+02 5.475e+02 8.069e+02, threshold=8.235e+02, percent-clipped=4.0 2023-06-25 01:53:54,206 INFO [train.py:996] (3/4) Epoch 7, batch 22150, loss[loss=0.2312, simple_loss=0.3042, pruned_loss=0.07915, over 21892.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.293, pruned_loss=0.07276, over 4265480.91 frames. ], batch size: 107, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:54:14,836 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.40 vs. limit=6.0 2023-06-25 01:54:44,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-25 01:54:56,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-25 01:55:41,169 INFO [train.py:996] (3/4) Epoch 7, batch 22200, loss[loss=0.2493, simple_loss=0.3238, pruned_loss=0.08742, over 21786.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2968, pruned_loss=0.07452, over 4269529.45 frames. ], batch size: 441, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:56:06,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1231062.0, ans=0.0 2023-06-25 01:56:20,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-25 01:57:21,483 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.92 vs. limit=15.0 2023-06-25 01:57:25,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.120e+02 3.891e+02 5.411e+02 1.488e+03, threshold=7.782e+02, percent-clipped=8.0 2023-06-25 01:57:31,130 INFO [train.py:996] (3/4) Epoch 7, batch 22250, loss[loss=0.2424, simple_loss=0.3156, pruned_loss=0.08462, over 21482.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3036, pruned_loss=0.07601, over 4272261.00 frames. ], batch size: 211, lr: 4.24e-03, grad_scale: 16.0 2023-06-25 01:58:20,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1231422.0, ans=0.1 2023-06-25 01:58:37,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1231482.0, ans=0.125 2023-06-25 01:58:41,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.25 vs. limit=6.0 2023-06-25 01:59:12,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-25 01:59:18,341 INFO [train.py:996] (3/4) Epoch 7, batch 22300, loss[loss=0.2199, simple_loss=0.2829, pruned_loss=0.07843, over 21491.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3035, pruned_loss=0.07759, over 4275717.37 frames. ], batch size: 211, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 01:59:20,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1231602.0, ans=0.0 2023-06-25 02:00:15,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1231722.0, ans=0.125 2023-06-25 02:00:31,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1231782.0, ans=0.1 2023-06-25 02:00:46,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1231842.0, ans=0.125 2023-06-25 02:01:00,002 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.515e+02 3.143e+02 3.997e+02 5.587e+02 8.969e+02, threshold=7.995e+02, percent-clipped=6.0 2023-06-25 02:01:10,872 INFO [train.py:996] (3/4) Epoch 7, batch 22350, loss[loss=0.2237, simple_loss=0.2836, pruned_loss=0.08188, over 21554.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3011, pruned_loss=0.07799, over 4288024.10 frames. ], batch size: 548, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:01:35,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1231962.0, ans=0.1 2023-06-25 02:01:50,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-06-25 02:01:55,569 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:02:31,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1232082.0, ans=0.0 2023-06-25 02:02:54,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1232142.0, ans=0.125 2023-06-25 02:02:57,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1232142.0, ans=0.125 2023-06-25 02:02:59,812 INFO [train.py:996] (3/4) Epoch 7, batch 22400, loss[loss=0.2073, simple_loss=0.2783, pruned_loss=0.06813, over 21698.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2971, pruned_loss=0.07454, over 4285681.91 frames. ], batch size: 112, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:03:15,013 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=15.0 2023-06-25 02:04:04,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1232322.0, ans=0.125 2023-06-25 02:04:32,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1232442.0, ans=0.2 2023-06-25 02:04:42,725 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 2.737e+02 3.177e+02 4.252e+02 6.969e+02, threshold=6.354e+02, percent-clipped=0.0 2023-06-25 02:04:48,410 INFO [train.py:996] (3/4) Epoch 7, batch 22450, loss[loss=0.2151, simple_loss=0.2782, pruned_loss=0.07598, over 21797.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2919, pruned_loss=0.07436, over 4275482.63 frames. ], batch size: 98, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:06:04,317 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.96 vs. limit=22.5 2023-06-25 02:06:33,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1232742.0, ans=0.125 2023-06-25 02:06:43,929 INFO [train.py:996] (3/4) Epoch 7, batch 22500, loss[loss=0.1867, simple_loss=0.2548, pruned_loss=0.05925, over 21932.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2871, pruned_loss=0.07351, over 4274009.18 frames. ], batch size: 113, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:08:06,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1232982.0, ans=0.0 2023-06-25 02:08:08,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1233042.0, ans=0.125 2023-06-25 02:08:15,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1233042.0, ans=0.2 2023-06-25 02:08:17,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1233042.0, ans=0.0 2023-06-25 02:08:22,746 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 2.995e+02 3.831e+02 4.510e+02 7.998e+02, threshold=7.663e+02, percent-clipped=9.0 2023-06-25 02:08:32,940 INFO [train.py:996] (3/4) Epoch 7, batch 22550, loss[loss=0.2735, simple_loss=0.328, pruned_loss=0.1095, over 21715.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2906, pruned_loss=0.07348, over 4284024.48 frames. ], batch size: 507, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:08:35,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1233102.0, ans=0.125 2023-06-25 02:08:42,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1233102.0, ans=0.0 2023-06-25 02:09:50,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-25 02:10:15,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1233342.0, ans=0.2 2023-06-25 02:10:20,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1233342.0, ans=0.125 2023-06-25 02:10:25,379 INFO [train.py:996] (3/4) Epoch 7, batch 22600, loss[loss=0.1911, simple_loss=0.2588, pruned_loss=0.06168, over 21425.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2945, pruned_loss=0.07435, over 4290443.89 frames. ], batch size: 211, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:10:31,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1233402.0, ans=0.125 2023-06-25 02:10:55,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1233462.0, ans=0.1 2023-06-25 02:11:29,621 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.12 vs. limit=15.0 2023-06-25 02:11:34,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-25 02:11:53,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1233582.0, ans=0.1 2023-06-25 02:12:10,425 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 3.222e+02 3.850e+02 5.288e+02 1.031e+03, threshold=7.700e+02, percent-clipped=4.0 2023-06-25 02:12:14,429 INFO [train.py:996] (3/4) Epoch 7, batch 22650, loss[loss=0.1869, simple_loss=0.2549, pruned_loss=0.05947, over 21761.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2912, pruned_loss=0.07359, over 4281609.90 frames. ], batch size: 124, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:12:30,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-25 02:12:40,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1233762.0, ans=0.125 2023-06-25 02:13:10,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1233822.0, ans=0.125 2023-06-25 02:14:01,934 INFO [train.py:996] (3/4) Epoch 7, batch 22700, loss[loss=0.1961, simple_loss=0.2683, pruned_loss=0.062, over 21803.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2866, pruned_loss=0.07271, over 4271874.20 frames. ], batch size: 102, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:14:27,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1234062.0, ans=0.1 2023-06-25 02:14:29,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1234062.0, ans=0.2 2023-06-25 02:14:36,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1234062.0, ans=0.125 2023-06-25 02:15:34,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1234242.0, ans=0.0 2023-06-25 02:15:37,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=15.0 2023-06-25 02:15:45,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1234242.0, ans=0.0 2023-06-25 02:15:46,860 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.300e+02 4.052e+02 5.642e+02 1.079e+03, threshold=8.104e+02, percent-clipped=7.0 2023-06-25 02:15:49,893 INFO [train.py:996] (3/4) Epoch 7, batch 22750, loss[loss=0.195, simple_loss=0.2629, pruned_loss=0.06355, over 20794.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2892, pruned_loss=0.07525, over 4261958.50 frames. ], batch size: 607, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:16:17,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1234362.0, ans=0.07 2023-06-25 02:16:42,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-25 02:17:15,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1234482.0, ans=0.125 2023-06-25 02:17:19,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-25 02:17:19,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.26 vs. limit=10.0 2023-06-25 02:17:36,744 INFO [train.py:996] (3/4) Epoch 7, batch 22800, loss[loss=0.2595, simple_loss=0.3258, pruned_loss=0.09666, over 21236.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2918, pruned_loss=0.07711, over 4273206.85 frames. ], batch size: 143, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:18:04,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1234662.0, ans=0.0 2023-06-25 02:18:16,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1234662.0, ans=0.09899494936611666 2023-06-25 02:18:22,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1234722.0, ans=0.2 2023-06-25 02:18:24,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-25 02:19:03,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-25 02:19:23,105 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 3.142e+02 3.789e+02 4.718e+02 7.259e+02, threshold=7.578e+02, percent-clipped=0.0 2023-06-25 02:19:25,152 INFO [train.py:996] (3/4) Epoch 7, batch 22850, loss[loss=0.213, simple_loss=0.2787, pruned_loss=0.0736, over 21387.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2877, pruned_loss=0.07609, over 4278271.74 frames. ], batch size: 211, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:20:05,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1235022.0, ans=0.125 2023-06-25 02:20:23,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-25 02:20:50,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1235142.0, ans=0.125 2023-06-25 02:21:04,562 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:21:09,516 INFO [train.py:996] (3/4) Epoch 7, batch 22900, loss[loss=0.1763, simple_loss=0.2459, pruned_loss=0.05333, over 21853.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2885, pruned_loss=0.0754, over 4272770.54 frames. ], batch size: 107, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:21:42,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1235262.0, ans=0.2 2023-06-25 02:21:42,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1235262.0, ans=0.0 2023-06-25 02:22:07,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1235322.0, ans=0.0 2023-06-25 02:22:07,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1235322.0, ans=0.0 2023-06-25 02:22:36,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1235442.0, ans=0.0 2023-06-25 02:22:47,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-25 02:23:01,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1235442.0, ans=0.1 2023-06-25 02:23:04,025 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.455e+02 4.744e+02 6.371e+02 1.430e+03, threshold=9.487e+02, percent-clipped=13.0 2023-06-25 02:23:05,580 INFO [train.py:996] (3/4) Epoch 7, batch 22950, loss[loss=0.2218, simple_loss=0.3399, pruned_loss=0.05188, over 21752.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3016, pruned_loss=0.07419, over 4274071.89 frames. ], batch size: 298, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:23:56,970 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.11 vs. limit=10.0 2023-06-25 02:24:19,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1235682.0, ans=0.125 2023-06-25 02:24:53,114 INFO [train.py:996] (3/4) Epoch 7, batch 23000, loss[loss=0.2043, simple_loss=0.2739, pruned_loss=0.06739, over 21656.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2997, pruned_loss=0.07151, over 4280284.82 frames. ], batch size: 230, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:26:04,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1235982.0, ans=0.2 2023-06-25 02:26:40,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.049e+02 3.858e+02 4.759e+02 9.781e+02, threshold=7.716e+02, percent-clipped=2.0 2023-06-25 02:26:42,795 INFO [train.py:996] (3/4) Epoch 7, batch 23050, loss[loss=0.2831, simple_loss=0.3445, pruned_loss=0.1109, over 21807.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3016, pruned_loss=0.07365, over 4282886.09 frames. ], batch size: 441, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:28:25,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236342.0, ans=0.1 2023-06-25 02:28:31,641 INFO [train.py:996] (3/4) Epoch 7, batch 23100, loss[loss=0.1974, simple_loss=0.2579, pruned_loss=0.06846, over 21268.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2984, pruned_loss=0.07327, over 4284357.55 frames. ], batch size: 159, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:28:39,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236402.0, ans=0.1 2023-06-25 02:29:11,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1236462.0, ans=0.1 2023-06-25 02:30:16,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 3.056e+02 3.591e+02 4.604e+02 9.748e+02, threshold=7.182e+02, percent-clipped=1.0 2023-06-25 02:30:18,309 INFO [train.py:996] (3/4) Epoch 7, batch 23150, loss[loss=0.2499, simple_loss=0.3127, pruned_loss=0.09354, over 21828.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2955, pruned_loss=0.07339, over 4281950.60 frames. ], batch size: 414, lr: 4.23e-03, grad_scale: 16.0 2023-06-25 02:30:33,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1236762.0, ans=0.2 2023-06-25 02:31:01,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1236822.0, ans=0.125 2023-06-25 02:31:35,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1236942.0, ans=0.125 2023-06-25 02:31:37,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1236942.0, ans=0.0 2023-06-25 02:31:44,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236942.0, ans=0.1 2023-06-25 02:31:49,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1236942.0, ans=0.0 2023-06-25 02:32:03,886 INFO [train.py:996] (3/4) Epoch 7, batch 23200, loss[loss=0.2169, simple_loss=0.2872, pruned_loss=0.07334, over 21896.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2945, pruned_loss=0.07414, over 4291430.26 frames. ], batch size: 371, lr: 4.23e-03, grad_scale: 32.0 2023-06-25 02:32:04,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1237002.0, ans=0.125 2023-06-25 02:32:09,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1237002.0, ans=0.125 2023-06-25 02:32:54,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1237122.0, ans=10.0 2023-06-25 02:33:06,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1237182.0, ans=0.125 2023-06-25 02:33:45,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1237242.0, ans=0.125 2023-06-25 02:33:49,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1237242.0, ans=0.0 2023-06-25 02:33:52,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.126e+02 3.728e+02 5.060e+02 1.069e+03, threshold=7.456e+02, percent-clipped=4.0 2023-06-25 02:33:52,485 INFO [train.py:996] (3/4) Epoch 7, batch 23250, loss[loss=0.2471, simple_loss=0.3173, pruned_loss=0.08844, over 21702.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2939, pruned_loss=0.07538, over 4294913.00 frames. ], batch size: 389, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:35:26,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-25 02:35:31,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1237542.0, ans=0.125 2023-06-25 02:35:43,504 INFO [train.py:996] (3/4) Epoch 7, batch 23300, loss[loss=0.196, simple_loss=0.258, pruned_loss=0.06705, over 21155.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2988, pruned_loss=0.07597, over 4290146.54 frames. ], batch size: 608, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:35:45,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1237602.0, ans=0.1 2023-06-25 02:36:02,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1237602.0, ans=0.0 2023-06-25 02:36:58,904 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:37:23,353 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:37:39,006 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.304e+02 3.209e+02 3.833e+02 5.523e+02 1.342e+03, threshold=7.666e+02, percent-clipped=15.0 2023-06-25 02:37:39,037 INFO [train.py:996] (3/4) Epoch 7, batch 23350, loss[loss=0.1992, simple_loss=0.2876, pruned_loss=0.05536, over 21604.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3021, pruned_loss=0.07496, over 4281380.05 frames. ], batch size: 441, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:37:57,798 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-25 02:38:03,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1237962.0, ans=0.125 2023-06-25 02:38:40,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1238022.0, ans=0.04949747468305833 2023-06-25 02:39:04,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1238142.0, ans=0.1 2023-06-25 02:39:33,720 INFO [train.py:996] (3/4) Epoch 7, batch 23400, loss[loss=0.2029, simple_loss=0.2752, pruned_loss=0.06531, over 21933.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2958, pruned_loss=0.07137, over 4285406.18 frames. ], batch size: 316, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:39:56,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1238262.0, ans=0.1 2023-06-25 02:40:07,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1238262.0, ans=0.0 2023-06-25 02:41:13,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1238442.0, ans=0.125 2023-06-25 02:41:18,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1238442.0, ans=0.1 2023-06-25 02:41:23,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 3.153e+02 4.336e+02 5.410e+02 1.099e+03, threshold=8.672e+02, percent-clipped=12.0 2023-06-25 02:41:23,167 INFO [train.py:996] (3/4) Epoch 7, batch 23450, loss[loss=0.2538, simple_loss=0.3245, pruned_loss=0.09152, over 21291.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2973, pruned_loss=0.0734, over 4279230.13 frames. ], batch size: 176, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:41:37,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1238502.0, ans=0.125 2023-06-25 02:41:37,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1238502.0, ans=0.0 2023-06-25 02:41:44,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1238562.0, ans=0.125 2023-06-25 02:42:18,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1238622.0, ans=0.0 2023-06-25 02:42:37,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1238682.0, ans=0.015 2023-06-25 02:43:06,178 INFO [train.py:996] (3/4) Epoch 7, batch 23500, loss[loss=0.2472, simple_loss=0.3099, pruned_loss=0.09222, over 21858.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2992, pruned_loss=0.07458, over 4279700.80 frames. ], batch size: 441, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:43:49,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1238922.0, ans=0.125 2023-06-25 02:44:53,871 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 2.970e+02 3.465e+02 4.227e+02 7.885e+02, threshold=6.930e+02, percent-clipped=0.0 2023-06-25 02:44:53,900 INFO [train.py:996] (3/4) Epoch 7, batch 23550, loss[loss=0.1863, simple_loss=0.2472, pruned_loss=0.06269, over 21518.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2936, pruned_loss=0.07431, over 4285891.66 frames. ], batch size: 231, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:44:56,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1239102.0, ans=0.125 2023-06-25 02:44:56,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.97 vs. limit=15.0 2023-06-25 02:44:59,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1239102.0, ans=0.125 2023-06-25 02:45:01,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1239102.0, ans=0.125 2023-06-25 02:46:32,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1239342.0, ans=0.125 2023-06-25 02:46:34,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1239342.0, ans=0.125 2023-06-25 02:46:38,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1239342.0, ans=0.1 2023-06-25 02:46:42,569 INFO [train.py:996] (3/4) Epoch 7, batch 23600, loss[loss=0.2355, simple_loss=0.3141, pruned_loss=0.07843, over 21439.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2936, pruned_loss=0.07522, over 4280332.62 frames. ], batch size: 131, lr: 4.22e-03, grad_scale: 32.0 2023-06-25 02:47:25,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1239522.0, ans=0.0 2023-06-25 02:47:28,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1239522.0, ans=0.0 2023-06-25 02:47:35,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1239522.0, ans=0.5 2023-06-25 02:47:50,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1239522.0, ans=0.125 2023-06-25 02:48:01,578 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.83 vs. limit=15.0 2023-06-25 02:48:06,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1239582.0, ans=0.125 2023-06-25 02:48:18,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1239642.0, ans=0.0 2023-06-25 02:48:21,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1239642.0, ans=0.0 2023-06-25 02:48:28,069 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.161e+02 4.117e+02 5.105e+02 1.053e+03, threshold=8.234e+02, percent-clipped=8.0 2023-06-25 02:48:28,106 INFO [train.py:996] (3/4) Epoch 7, batch 23650, loss[loss=0.2442, simple_loss=0.3199, pruned_loss=0.08427, over 21301.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2941, pruned_loss=0.07337, over 4279572.51 frames. ], batch size: 159, lr: 4.22e-03, grad_scale: 32.0 2023-06-25 02:49:19,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1239822.0, ans=0.1 2023-06-25 02:49:53,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1239882.0, ans=0.2 2023-06-25 02:50:17,005 INFO [train.py:996] (3/4) Epoch 7, batch 23700, loss[loss=0.1923, simple_loss=0.2881, pruned_loss=0.0482, over 21626.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2971, pruned_loss=0.07308, over 4283564.09 frames. ], batch size: 247, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:50:42,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1240062.0, ans=0.1 2023-06-25 02:51:03,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1240062.0, ans=0.125 2023-06-25 02:51:24,545 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:51:26,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1240182.0, ans=0.125 2023-06-25 02:51:50,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1240242.0, ans=0.04949747468305833 2023-06-25 02:52:12,605 INFO [train.py:996] (3/4) Epoch 7, batch 23750, loss[loss=0.1869, simple_loss=0.2883, pruned_loss=0.04278, over 21800.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2991, pruned_loss=0.07361, over 4285263.43 frames. ], batch size: 282, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:52:14,425 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 3.374e+02 3.894e+02 5.027e+02 8.477e+02, threshold=7.788e+02, percent-clipped=1.0 2023-06-25 02:52:39,091 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-25 02:52:58,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1240422.0, ans=0.05 2023-06-25 02:53:08,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1240422.0, ans=0.125 2023-06-25 02:53:26,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1240482.0, ans=0.125 2023-06-25 02:53:30,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1240482.0, ans=0.05 2023-06-25 02:53:34,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1240542.0, ans=0.2 2023-06-25 02:54:03,168 INFO [train.py:996] (3/4) Epoch 7, batch 23800, loss[loss=0.2251, simple_loss=0.2973, pruned_loss=0.07649, over 21359.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2996, pruned_loss=0.07265, over 4273341.18 frames. ], batch size: 131, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:54:29,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1240602.0, ans=0.0 2023-06-25 02:54:34,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1240662.0, ans=0.125 2023-06-25 02:54:44,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1240662.0, ans=0.04949747468305833 2023-06-25 02:54:44,313 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:55:21,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1240782.0, ans=0.1 2023-06-25 02:55:29,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-25 02:56:06,033 INFO [train.py:996] (3/4) Epoch 7, batch 23850, loss[loss=0.288, simple_loss=0.3527, pruned_loss=0.1116, over 21427.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3077, pruned_loss=0.07468, over 4272519.63 frames. ], batch size: 471, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:56:07,964 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.127e+02 4.092e+02 4.859e+02 9.689e+02, threshold=8.184e+02, percent-clipped=5.0 2023-06-25 02:56:12,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1240902.0, ans=0.1 2023-06-25 02:56:39,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1240962.0, ans=0.0 2023-06-25 02:57:24,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1241082.0, ans=0.125 2023-06-25 02:57:50,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1241142.0, ans=0.025 2023-06-25 02:57:55,525 INFO [train.py:996] (3/4) Epoch 7, batch 23900, loss[loss=0.243, simple_loss=0.3186, pruned_loss=0.08374, over 21815.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3143, pruned_loss=0.07709, over 4278483.68 frames. ], batch size: 98, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:58:13,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1241262.0, ans=0.1 2023-06-25 02:58:21,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1241262.0, ans=10.0 2023-06-25 02:58:24,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1241262.0, ans=0.1 2023-06-25 02:58:33,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1241322.0, ans=0.0 2023-06-25 02:58:37,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1241322.0, ans=0.2 2023-06-25 02:59:38,296 INFO [train.py:996] (3/4) Epoch 7, batch 23950, loss[loss=0.2261, simple_loss=0.2991, pruned_loss=0.07659, over 16097.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3074, pruned_loss=0.07637, over 4271782.76 frames. ], batch size: 62, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 02:59:39,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.625e+02 3.372e+02 4.562e+02 5.557e+02 1.074e+03, threshold=9.124e+02, percent-clipped=7.0 2023-06-25 03:00:15,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1241622.0, ans=0.0 2023-06-25 03:00:54,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1241682.0, ans=0.125 2023-06-25 03:01:12,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-25 03:01:27,365 INFO [train.py:996] (3/4) Epoch 7, batch 24000, loss[loss=0.2448, simple_loss=0.3188, pruned_loss=0.08536, over 21455.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3086, pruned_loss=0.07901, over 4276958.80 frames. ], batch size: 211, lr: 4.22e-03, grad_scale: 32.0 2023-06-25 03:01:27,365 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 03:01:45,553 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2668, simple_loss=0.3629, pruned_loss=0.0854, over 1796401.00 frames. 2023-06-25 03:01:45,554 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 03:02:03,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1241862.0, ans=0.0 2023-06-25 03:02:10,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=12.0 2023-06-25 03:02:33,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1241922.0, ans=0.125 2023-06-25 03:02:57,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1241982.0, ans=0.125 2023-06-25 03:03:27,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1242042.0, ans=0.0 2023-06-25 03:03:35,952 INFO [train.py:996] (3/4) Epoch 7, batch 24050, loss[loss=0.2256, simple_loss=0.3136, pruned_loss=0.06876, over 21759.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3103, pruned_loss=0.079, over 4268308.14 frames. ], batch size: 351, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:03:39,405 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.516e+02 4.440e+02 5.748e+02 1.093e+03, threshold=8.881e+02, percent-clipped=2.0 2023-06-25 03:03:41,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1242102.0, ans=0.0 2023-06-25 03:04:56,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1242282.0, ans=0.2 2023-06-25 03:05:20,279 INFO [train.py:996] (3/4) Epoch 7, batch 24100, loss[loss=0.2339, simple_loss=0.315, pruned_loss=0.07634, over 21724.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3097, pruned_loss=0.07752, over 4268005.05 frames. ], batch size: 298, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:05:57,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1242462.0, ans=0.125 2023-06-25 03:06:28,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1242522.0, ans=0.0 2023-06-25 03:06:59,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1242642.0, ans=0.125 2023-06-25 03:07:09,401 INFO [train.py:996] (3/4) Epoch 7, batch 24150, loss[loss=0.2368, simple_loss=0.3081, pruned_loss=0.08272, over 21274.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3088, pruned_loss=0.0784, over 4272666.31 frames. ], batch size: 143, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:07:12,835 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.235e+02 4.030e+02 4.867e+02 1.048e+03, threshold=8.060e+02, percent-clipped=3.0 2023-06-25 03:07:13,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1242702.0, ans=0.0 2023-06-25 03:08:05,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1242822.0, ans=0.2 2023-06-25 03:08:38,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=1242942.0, ans=15.0 2023-06-25 03:08:53,061 INFO [train.py:996] (3/4) Epoch 7, batch 24200, loss[loss=0.2183, simple_loss=0.3049, pruned_loss=0.06588, over 21680.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3112, pruned_loss=0.07981, over 4275676.18 frames. ], batch size: 263, lr: 4.22e-03, grad_scale: 16.0 2023-06-25 03:09:00,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1243002.0, ans=0.125 2023-06-25 03:10:09,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1243182.0, ans=0.2 2023-06-25 03:10:48,484 INFO [train.py:996] (3/4) Epoch 7, batch 24250, loss[loss=0.1803, simple_loss=0.2723, pruned_loss=0.04419, over 21341.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3073, pruned_loss=0.0745, over 4269288.09 frames. ], batch size: 176, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:10:49,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1243302.0, ans=0.125 2023-06-25 03:10:51,920 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 3.061e+02 3.870e+02 4.839e+02 8.744e+02, threshold=7.741e+02, percent-clipped=3.0 2023-06-25 03:11:43,213 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:11:56,756 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:12:38,083 INFO [train.py:996] (3/4) Epoch 7, batch 24300, loss[loss=0.2364, simple_loss=0.3456, pruned_loss=0.06365, over 20769.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3035, pruned_loss=0.07007, over 4269010.39 frames. ], batch size: 607, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:12:42,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1243602.0, ans=10.0 2023-06-25 03:12:42,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1243602.0, ans=0.2 2023-06-25 03:12:59,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1243602.0, ans=0.125 2023-06-25 03:13:10,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1243662.0, ans=0.125 2023-06-25 03:13:58,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1243782.0, ans=0.035 2023-06-25 03:14:08,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-25 03:14:26,049 INFO [train.py:996] (3/4) Epoch 7, batch 24350, loss[loss=0.263, simple_loss=0.3302, pruned_loss=0.09794, over 21813.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3008, pruned_loss=0.07091, over 4273489.22 frames. ], batch size: 441, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:14:34,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.804e+02 3.474e+02 4.596e+02 8.821e+02, threshold=6.948e+02, percent-clipped=1.0 2023-06-25 03:14:45,444 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-25 03:15:05,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1243962.0, ans=0.0 2023-06-25 03:16:20,447 INFO [train.py:996] (3/4) Epoch 7, batch 24400, loss[loss=0.2268, simple_loss=0.3093, pruned_loss=0.07221, over 21292.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3043, pruned_loss=0.07361, over 4281602.51 frames. ], batch size: 548, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:16:28,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1244202.0, ans=0.07 2023-06-25 03:16:47,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1244262.0, ans=0.125 2023-06-25 03:17:02,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1244322.0, ans=0.0 2023-06-25 03:17:14,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1244322.0, ans=0.1 2023-06-25 03:17:24,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1244382.0, ans=0.125 2023-06-25 03:17:29,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1244382.0, ans=0.125 2023-06-25 03:18:15,744 INFO [train.py:996] (3/4) Epoch 7, batch 24450, loss[loss=0.3425, simple_loss=0.4132, pruned_loss=0.1359, over 21431.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3059, pruned_loss=0.07497, over 4278562.53 frames. ], batch size: 507, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:18:19,264 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.592e+02 3.443e+02 3.965e+02 5.571e+02 1.139e+03, threshold=7.931e+02, percent-clipped=16.0 2023-06-25 03:18:20,466 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-25 03:18:28,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.64 vs. limit=12.0 2023-06-25 03:19:50,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1244742.0, ans=0.09899494936611666 2023-06-25 03:20:03,674 INFO [train.py:996] (3/4) Epoch 7, batch 24500, loss[loss=0.1995, simple_loss=0.2739, pruned_loss=0.06251, over 21133.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3044, pruned_loss=0.07416, over 4277928.40 frames. ], batch size: 608, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:20:08,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-25 03:20:11,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1244802.0, ans=0.0 2023-06-25 03:20:21,004 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.27 vs. limit=15.0 2023-06-25 03:21:31,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1245042.0, ans=0.04949747468305833 2023-06-25 03:21:46,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.38 vs. limit=15.0 2023-06-25 03:21:48,773 INFO [train.py:996] (3/4) Epoch 7, batch 24550, loss[loss=0.2656, simple_loss=0.3367, pruned_loss=0.09724, over 21416.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3042, pruned_loss=0.07564, over 4271669.14 frames. ], batch size: 159, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:21:53,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 2.970e+02 3.569e+02 4.682e+02 1.145e+03, threshold=7.139e+02, percent-clipped=2.0 2023-06-25 03:22:31,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-25 03:22:49,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1245282.0, ans=0.0 2023-06-25 03:23:15,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-06-25 03:23:19,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-25 03:23:27,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1245342.0, ans=0.125 2023-06-25 03:23:31,368 INFO [train.py:996] (3/4) Epoch 7, batch 24600, loss[loss=0.1823, simple_loss=0.2495, pruned_loss=0.05758, over 21355.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3, pruned_loss=0.0757, over 4268956.83 frames. ], batch size: 194, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:23:50,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1245402.0, ans=0.2 2023-06-25 03:24:02,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1245462.0, ans=0.0 2023-06-25 03:24:30,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1245522.0, ans=0.0 2023-06-25 03:24:51,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1245582.0, ans=0.125 2023-06-25 03:24:54,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1245642.0, ans=0.0 2023-06-25 03:24:54,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1245642.0, ans=0.125 2023-06-25 03:25:00,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1245642.0, ans=0.125 2023-06-25 03:25:14,507 INFO [train.py:996] (3/4) Epoch 7, batch 24650, loss[loss=0.2066, simple_loss=0.2784, pruned_loss=0.06734, over 21591.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2925, pruned_loss=0.07412, over 4264935.23 frames. ], batch size: 414, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:25:19,708 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.334e+02 3.258e+02 3.830e+02 5.672e+02 1.406e+03, threshold=7.660e+02, percent-clipped=13.0 2023-06-25 03:26:10,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1245822.0, ans=0.0 2023-06-25 03:26:36,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1245882.0, ans=0.1 2023-06-25 03:26:38,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1245882.0, ans=0.1 2023-06-25 03:26:49,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-25 03:26:57,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1245942.0, ans=0.0 2023-06-25 03:27:02,269 INFO [train.py:996] (3/4) Epoch 7, batch 24700, loss[loss=0.1893, simple_loss=0.2588, pruned_loss=0.05985, over 21830.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2898, pruned_loss=0.07241, over 4261539.83 frames. ], batch size: 118, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:28:01,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1246122.0, ans=0.125 2023-06-25 03:28:49,788 INFO [train.py:996] (3/4) Epoch 7, batch 24750, loss[loss=0.1755, simple_loss=0.2418, pruned_loss=0.05458, over 21607.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2838, pruned_loss=0.07009, over 4264903.66 frames. ], batch size: 263, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:28:54,681 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.901e+02 3.279e+02 4.785e+02 1.213e+03, threshold=6.557e+02, percent-clipped=5.0 2023-06-25 03:30:04,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1246482.0, ans=0.0 2023-06-25 03:30:35,880 INFO [train.py:996] (3/4) Epoch 7, batch 24800, loss[loss=0.2096, simple_loss=0.2774, pruned_loss=0.0709, over 21869.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2797, pruned_loss=0.06931, over 4267357.50 frames. ], batch size: 371, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:30:49,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-25 03:30:53,035 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.90 vs. limit=10.0 2023-06-25 03:31:08,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1246662.0, ans=0.2 2023-06-25 03:31:43,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1246782.0, ans=0.1 2023-06-25 03:31:53,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1246782.0, ans=0.125 2023-06-25 03:32:04,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1246842.0, ans=0.0 2023-06-25 03:32:15,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1246842.0, ans=0.0 2023-06-25 03:32:23,868 INFO [train.py:996] (3/4) Epoch 7, batch 24850, loss[loss=0.235, simple_loss=0.3013, pruned_loss=0.08433, over 20114.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2812, pruned_loss=0.07094, over 4269838.86 frames. ], batch size: 703, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:32:30,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 3.124e+02 3.906e+02 4.909e+02 9.613e+02, threshold=7.812e+02, percent-clipped=9.0 2023-06-25 03:32:40,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1246962.0, ans=0.1 2023-06-25 03:32:54,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1246962.0, ans=0.125 2023-06-25 03:33:33,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1247082.0, ans=0.125 2023-06-25 03:34:14,035 INFO [train.py:996] (3/4) Epoch 7, batch 24900, loss[loss=0.2484, simple_loss=0.3252, pruned_loss=0.08585, over 21203.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2828, pruned_loss=0.07176, over 4265755.81 frames. ], batch size: 143, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:34:26,943 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-25 03:35:01,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1247262.0, ans=0.05 2023-06-25 03:35:38,465 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.46 vs. limit=15.0 2023-06-25 03:36:08,373 INFO [train.py:996] (3/4) Epoch 7, batch 24950, loss[loss=0.2537, simple_loss=0.3262, pruned_loss=0.09057, over 21319.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2901, pruned_loss=0.07545, over 4266795.99 frames. ], batch size: 159, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:36:09,594 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=12.0 2023-06-25 03:36:15,231 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.748e+02 3.765e+02 4.804e+02 6.774e+02 1.687e+03, threshold=9.608e+02, percent-clipped=17.0 2023-06-25 03:36:49,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1247562.0, ans=0.125 2023-06-25 03:36:52,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1247622.0, ans=0.125 2023-06-25 03:37:50,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-25 03:37:55,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1247742.0, ans=0.125 2023-06-25 03:37:57,849 INFO [train.py:996] (3/4) Epoch 7, batch 25000, loss[loss=0.1999, simple_loss=0.2706, pruned_loss=0.06463, over 21639.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2979, pruned_loss=0.07707, over 4259205.18 frames. ], batch size: 282, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:38:02,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1247802.0, ans=0.0 2023-06-25 03:38:30,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1247862.0, ans=0.0 2023-06-25 03:38:37,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1247862.0, ans=0.5 2023-06-25 03:39:14,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1247982.0, ans=0.125 2023-06-25 03:39:37,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1248042.0, ans=0.035 2023-06-25 03:39:47,803 INFO [train.py:996] (3/4) Epoch 7, batch 25050, loss[loss=0.2155, simple_loss=0.273, pruned_loss=0.07901, over 21598.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.292, pruned_loss=0.07584, over 4264094.37 frames. ], batch size: 415, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:39:59,668 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.536e+02 3.278e+02 3.984e+02 5.261e+02 1.222e+03, threshold=7.967e+02, percent-clipped=1.0 2023-06-25 03:40:47,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1248222.0, ans=0.125 2023-06-25 03:40:54,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1248282.0, ans=0.125 2023-06-25 03:40:56,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-06-25 03:41:11,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1248342.0, ans=0.0 2023-06-25 03:41:35,611 INFO [train.py:996] (3/4) Epoch 7, batch 25100, loss[loss=0.1873, simple_loss=0.2554, pruned_loss=0.05959, over 21377.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2849, pruned_loss=0.07431, over 4265030.56 frames. ], batch size: 211, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:42:09,485 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.64 vs. limit=15.0 2023-06-25 03:42:38,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=22.5 2023-06-25 03:43:15,189 INFO [train.py:996] (3/4) Epoch 7, batch 25150, loss[loss=0.2141, simple_loss=0.2915, pruned_loss=0.06839, over 21748.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2903, pruned_loss=0.07282, over 4267454.86 frames. ], batch size: 247, lr: 4.21e-03, grad_scale: 16.0 2023-06-25 03:43:22,414 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 2.917e+02 3.507e+02 4.290e+02 7.134e+02, threshold=7.014e+02, percent-clipped=0.0 2023-06-25 03:44:07,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1248822.0, ans=0.0 2023-06-25 03:44:24,985 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:44:34,631 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-25 03:44:46,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1248942.0, ans=0.1 2023-06-25 03:45:03,178 INFO [train.py:996] (3/4) Epoch 7, batch 25200, loss[loss=0.186, simple_loss=0.2778, pruned_loss=0.04715, over 21242.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2914, pruned_loss=0.07067, over 4263265.98 frames. ], batch size: 176, lr: 4.21e-03, grad_scale: 32.0 2023-06-25 03:45:26,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1249062.0, ans=0.125 2023-06-25 03:46:05,663 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:46:34,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1249242.0, ans=0.125 2023-06-25 03:46:44,513 INFO [train.py:996] (3/4) Epoch 7, batch 25250, loss[loss=0.2011, simple_loss=0.2674, pruned_loss=0.0674, over 21734.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2898, pruned_loss=0.06941, over 4263764.56 frames. ], batch size: 334, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 03:46:50,725 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 3.493e+02 4.531e+02 6.299e+02 1.264e+03, threshold=9.062e+02, percent-clipped=19.0 2023-06-25 03:47:02,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1249362.0, ans=0.2 2023-06-25 03:48:11,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1249542.0, ans=0.125 2023-06-25 03:48:32,318 INFO [train.py:996] (3/4) Epoch 7, batch 25300, loss[loss=0.2799, simple_loss=0.3491, pruned_loss=0.1054, over 21434.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2883, pruned_loss=0.0695, over 4261387.09 frames. ], batch size: 471, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 03:49:01,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.02 vs. limit=15.0 2023-06-25 03:49:53,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1249782.0, ans=0.2 2023-06-25 03:50:13,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1249842.0, ans=0.125 2023-06-25 03:50:20,519 INFO [train.py:996] (3/4) Epoch 7, batch 25350, loss[loss=0.1977, simple_loss=0.2481, pruned_loss=0.07367, over 20279.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2884, pruned_loss=0.06885, over 4246630.51 frames. ], batch size: 703, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:50:29,457 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 2.853e+02 3.365e+02 4.532e+02 7.857e+02, threshold=6.730e+02, percent-clipped=0.0 2023-06-25 03:51:04,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1250022.0, ans=0.125 2023-06-25 03:51:21,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1250082.0, ans=0.05 2023-06-25 03:51:21,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1250082.0, ans=0.125 2023-06-25 03:52:03,092 INFO [train.py:996] (3/4) Epoch 7, batch 25400, loss[loss=0.2119, simple_loss=0.2848, pruned_loss=0.06952, over 21705.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2845, pruned_loss=0.06802, over 4239521.82 frames. ], batch size: 282, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:52:10,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1250202.0, ans=0.125 2023-06-25 03:52:22,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1250262.0, ans=0.0 2023-06-25 03:52:38,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1250262.0, ans=0.1 2023-06-25 03:53:26,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1250442.0, ans=0.125 2023-06-25 03:53:46,473 INFO [train.py:996] (3/4) Epoch 7, batch 25450, loss[loss=0.2367, simple_loss=0.3085, pruned_loss=0.08243, over 21733.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2835, pruned_loss=0.06928, over 4236133.21 frames. ], batch size: 389, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:53:55,093 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.979e+02 3.775e+02 5.252e+02 7.977e+02, threshold=7.549e+02, percent-clipped=6.0 2023-06-25 03:54:32,225 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:55:14,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1250742.0, ans=0.125 2023-06-25 03:55:21,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1250742.0, ans=0.0 2023-06-25 03:55:32,118 INFO [train.py:996] (3/4) Epoch 7, batch 25500, loss[loss=0.2384, simple_loss=0.3217, pruned_loss=0.07753, over 21615.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2856, pruned_loss=0.06683, over 4250619.67 frames. ], batch size: 389, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:55:53,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1250802.0, ans=0.125 2023-06-25 03:55:53,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1250802.0, ans=0.125 2023-06-25 03:56:31,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1250922.0, ans=0.125 2023-06-25 03:56:42,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1250982.0, ans=0.07 2023-06-25 03:57:03,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1251042.0, ans=0.1 2023-06-25 03:57:13,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1251042.0, ans=0.0 2023-06-25 03:57:13,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1251042.0, ans=0.1 2023-06-25 03:57:27,515 INFO [train.py:996] (3/4) Epoch 7, batch 25550, loss[loss=0.2053, simple_loss=0.3096, pruned_loss=0.0505, over 21870.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2922, pruned_loss=0.06714, over 4259823.21 frames. ], batch size: 316, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 03:57:41,638 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.132e+02 4.314e+02 5.832e+02 9.037e+02, threshold=8.627e+02, percent-clipped=4.0 2023-06-25 03:57:42,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1251102.0, ans=0.2 2023-06-25 03:57:44,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1251102.0, ans=0.04949747468305833 2023-06-25 03:58:13,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1251222.0, ans=0.0 2023-06-25 03:58:39,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1251282.0, ans=0.1 2023-06-25 03:59:19,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1251342.0, ans=0.125 2023-06-25 03:59:21,922 INFO [train.py:996] (3/4) Epoch 7, batch 25600, loss[loss=0.2511, simple_loss=0.3283, pruned_loss=0.08697, over 21615.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2965, pruned_loss=0.06842, over 4261635.22 frames. ], batch size: 389, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 04:01:09,209 INFO [train.py:996] (3/4) Epoch 7, batch 25650, loss[loss=0.202, simple_loss=0.2703, pruned_loss=0.06686, over 21605.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2972, pruned_loss=0.07057, over 4253083.71 frames. ], batch size: 298, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:01:19,262 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.050e+02 3.577e+02 4.545e+02 8.924e+02, threshold=7.154e+02, percent-clipped=2.0 2023-06-25 04:01:25,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1251762.0, ans=0.2 2023-06-25 04:01:45,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1251822.0, ans=0.0 2023-06-25 04:02:54,039 INFO [train.py:996] (3/4) Epoch 7, batch 25700, loss[loss=0.2755, simple_loss=0.34, pruned_loss=0.1055, over 21457.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2949, pruned_loss=0.07156, over 4257667.05 frames. ], batch size: 471, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:02:56,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1252002.0, ans=0.125 2023-06-25 04:03:04,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1252002.0, ans=0.2 2023-06-25 04:03:23,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2023-06-25 04:03:51,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1252122.0, ans=0.125 2023-06-25 04:04:35,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.21 vs. limit=10.0 2023-06-25 04:04:43,960 INFO [train.py:996] (3/4) Epoch 7, batch 25750, loss[loss=0.25, simple_loss=0.3238, pruned_loss=0.08806, over 21588.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2999, pruned_loss=0.07411, over 4262607.34 frames. ], batch size: 389, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:04:45,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-25 04:04:55,423 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 3.207e+02 3.828e+02 5.534e+02 9.207e+02, threshold=7.655e+02, percent-clipped=4.0 2023-06-25 04:06:24,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-25 04:06:41,406 INFO [train.py:996] (3/4) Epoch 7, batch 25800, loss[loss=0.2378, simple_loss=0.3172, pruned_loss=0.0792, over 21715.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3132, pruned_loss=0.07838, over 4264673.22 frames. ], batch size: 332, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:07:31,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-25 04:07:54,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1252782.0, ans=0.125 2023-06-25 04:08:01,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=12.0 2023-06-25 04:08:06,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.64 vs. limit=15.0 2023-06-25 04:08:08,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1252842.0, ans=0.125 2023-06-25 04:08:36,041 INFO [train.py:996] (3/4) Epoch 7, batch 25850, loss[loss=0.2213, simple_loss=0.2905, pruned_loss=0.07607, over 21732.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3141, pruned_loss=0.07779, over 4265120.42 frames. ], batch size: 230, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:08:36,605 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:08:46,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.799e+02 4.980e+02 7.138e+02 1.041e+03, threshold=9.960e+02, percent-clipped=14.0 2023-06-25 04:09:07,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1252962.0, ans=0.2 2023-06-25 04:09:38,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.08 vs. limit=15.0 2023-06-25 04:09:50,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=15.0 2023-06-25 04:10:12,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1253142.0, ans=0.125 2023-06-25 04:10:24,678 INFO [train.py:996] (3/4) Epoch 7, batch 25900, loss[loss=0.3143, simple_loss=0.4001, pruned_loss=0.1142, over 21708.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3153, pruned_loss=0.07888, over 4274945.11 frames. ], batch size: 414, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:10:37,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1253202.0, ans=0.2 2023-06-25 04:10:59,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1253262.0, ans=10.0 2023-06-25 04:11:20,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-06-25 04:11:52,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.89 vs. limit=10.0 2023-06-25 04:12:11,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1253442.0, ans=0.1 2023-06-25 04:12:19,420 INFO [train.py:996] (3/4) Epoch 7, batch 25950, loss[loss=0.2084, simple_loss=0.2882, pruned_loss=0.06431, over 21873.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3217, pruned_loss=0.08199, over 4281791.08 frames. ], batch size: 107, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:12:21,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1253502.0, ans=0.0 2023-06-25 04:12:27,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1253502.0, ans=0.125 2023-06-25 04:12:30,257 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.924e+02 4.825e+02 6.667e+02 9.345e+02, threshold=9.651e+02, percent-clipped=0.0 2023-06-25 04:12:44,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1253562.0, ans=0.0 2023-06-25 04:13:03,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1253622.0, ans=0.0 2023-06-25 04:13:52,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1253742.0, ans=0.125 2023-06-25 04:14:08,569 INFO [train.py:996] (3/4) Epoch 7, batch 26000, loss[loss=0.2472, simple_loss=0.3279, pruned_loss=0.08323, over 21353.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.32, pruned_loss=0.08031, over 4274932.09 frames. ], batch size: 176, lr: 4.20e-03, grad_scale: 32.0 2023-06-25 04:14:17,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.32 vs. limit=6.0 2023-06-25 04:14:34,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1253862.0, ans=0.0 2023-06-25 04:15:13,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1253982.0, ans=0.125 2023-06-25 04:15:58,147 INFO [train.py:996] (3/4) Epoch 7, batch 26050, loss[loss=0.2165, simple_loss=0.2864, pruned_loss=0.07331, over 21482.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3195, pruned_loss=0.08113, over 4273580.79 frames. ], batch size: 211, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:16:10,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.188e+02 3.821e+02 5.430e+02 8.574e+02, threshold=7.643e+02, percent-clipped=0.0 2023-06-25 04:16:44,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1254222.0, ans=0.125 2023-06-25 04:17:02,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1254282.0, ans=0.025 2023-06-25 04:17:02,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1254282.0, ans=0.1 2023-06-25 04:17:12,024 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:17:45,906 INFO [train.py:996] (3/4) Epoch 7, batch 26100, loss[loss=0.2131, simple_loss=0.2813, pruned_loss=0.07242, over 21686.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.314, pruned_loss=0.07974, over 4280605.23 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:18:00,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1254402.0, ans=0.1 2023-06-25 04:19:21,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1254642.0, ans=0.0 2023-06-25 04:19:34,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1254702.0, ans=0.125 2023-06-25 04:19:35,051 INFO [train.py:996] (3/4) Epoch 7, batch 26150, loss[loss=0.2109, simple_loss=0.2869, pruned_loss=0.06741, over 21800.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3101, pruned_loss=0.07947, over 4285236.87 frames. ], batch size: 247, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:19:47,517 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.240e+02 3.858e+02 5.306e+02 8.605e+02, threshold=7.716e+02, percent-clipped=2.0 2023-06-25 04:20:02,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1254762.0, ans=0.125 2023-06-25 04:20:28,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=22.5 2023-06-25 04:21:05,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1254942.0, ans=0.125 2023-06-25 04:21:14,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1254942.0, ans=0.1 2023-06-25 04:21:24,107 INFO [train.py:996] (3/4) Epoch 7, batch 26200, loss[loss=0.2072, simple_loss=0.299, pruned_loss=0.05766, over 21753.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3108, pruned_loss=0.07737, over 4285103.99 frames. ], batch size: 124, lr: 4.20e-03, grad_scale: 16.0 2023-06-25 04:21:35,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1255002.0, ans=0.125 2023-06-25 04:21:43,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-25 04:22:08,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1255122.0, ans=0.0 2023-06-25 04:22:46,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-25 04:22:47,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1255182.0, ans=0.1 2023-06-25 04:23:11,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1255302.0, ans=0.125 2023-06-25 04:23:13,407 INFO [train.py:996] (3/4) Epoch 7, batch 26250, loss[loss=0.2147, simple_loss=0.292, pruned_loss=0.0687, over 21684.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3139, pruned_loss=0.07649, over 4281409.44 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:23:25,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.172e+02 3.762e+02 4.925e+02 1.309e+03, threshold=7.524e+02, percent-clipped=5.0 2023-06-25 04:24:18,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255482.0, ans=0.1 2023-06-25 04:24:20,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1255482.0, ans=0.2 2023-06-25 04:24:45,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255542.0, ans=0.1 2023-06-25 04:25:01,095 INFO [train.py:996] (3/4) Epoch 7, batch 26300, loss[loss=0.2259, simple_loss=0.2997, pruned_loss=0.07599, over 21841.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3119, pruned_loss=0.07714, over 4285299.00 frames. ], batch size: 124, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:25:13,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1255602.0, ans=0.125 2023-06-25 04:25:37,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-25 04:25:52,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1255722.0, ans=0.125 2023-06-25 04:26:15,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1255782.0, ans=0.125 2023-06-25 04:26:25,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1255782.0, ans=0.125 2023-06-25 04:26:53,857 INFO [train.py:996] (3/4) Epoch 7, batch 26350, loss[loss=0.2728, simple_loss=0.3379, pruned_loss=0.1039, over 21599.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3104, pruned_loss=0.0781, over 4285014.87 frames. ], batch size: 415, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:27:11,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.447e+02 3.110e+02 3.681e+02 4.505e+02 7.991e+02, threshold=7.361e+02, percent-clipped=2.0 2023-06-25 04:27:17,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1255962.0, ans=0.125 2023-06-25 04:27:51,686 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.40 vs. limit=10.0 2023-06-25 04:28:08,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1256082.0, ans=0.1 2023-06-25 04:28:40,440 INFO [train.py:996] (3/4) Epoch 7, batch 26400, loss[loss=0.21, simple_loss=0.2708, pruned_loss=0.07461, over 21836.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3047, pruned_loss=0.07826, over 4283028.44 frames. ], batch size: 372, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:29:36,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1256322.0, ans=0.0 2023-06-25 04:29:42,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.50 vs. limit=15.0 2023-06-25 04:30:01,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1256382.0, ans=0.125 2023-06-25 04:30:15,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.30 vs. limit=15.0 2023-06-25 04:30:16,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1256442.0, ans=0.1 2023-06-25 04:30:28,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1256442.0, ans=0.0 2023-06-25 04:30:39,801 INFO [train.py:996] (3/4) Epoch 7, batch 26450, loss[loss=0.2508, simple_loss=0.352, pruned_loss=0.07483, over 21729.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3062, pruned_loss=0.07795, over 4278463.72 frames. ], batch size: 332, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:30:40,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-25 04:30:57,250 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.534e+02 4.471e+02 5.534e+02 1.801e+03, threshold=8.941e+02, percent-clipped=10.0 2023-06-25 04:31:01,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1256562.0, ans=0.0 2023-06-25 04:32:24,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1256742.0, ans=0.0 2023-06-25 04:32:36,246 INFO [train.py:996] (3/4) Epoch 7, batch 26500, loss[loss=0.2339, simple_loss=0.3257, pruned_loss=0.07103, over 21680.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3075, pruned_loss=0.07655, over 4270742.06 frames. ], batch size: 389, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:33:06,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-06-25 04:33:29,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1256922.0, ans=0.0 2023-06-25 04:33:31,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1256922.0, ans=0.0 2023-06-25 04:33:36,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1256922.0, ans=0.125 2023-06-25 04:34:27,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1257042.0, ans=0.0 2023-06-25 04:34:33,093 INFO [train.py:996] (3/4) Epoch 7, batch 26550, loss[loss=0.195, simple_loss=0.3075, pruned_loss=0.04128, over 20716.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3023, pruned_loss=0.07362, over 4256206.72 frames. ], batch size: 608, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:34:33,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1257102.0, ans=0.2 2023-06-25 04:34:47,488 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 3.332e+02 4.391e+02 7.235e+02 1.419e+03, threshold=8.782e+02, percent-clipped=20.0 2023-06-25 04:35:29,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-25 04:35:51,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1257282.0, ans=0.0 2023-06-25 04:36:21,169 INFO [train.py:996] (3/4) Epoch 7, batch 26600, loss[loss=0.2006, simple_loss=0.2763, pruned_loss=0.06246, over 21729.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.303, pruned_loss=0.07085, over 4263467.70 frames. ], batch size: 112, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:36:22,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1257402.0, ans=0.125 2023-06-25 04:37:16,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1257522.0, ans=0.125 2023-06-25 04:37:31,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1257582.0, ans=0.125 2023-06-25 04:37:37,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-25 04:37:58,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1257642.0, ans=0.0 2023-06-25 04:38:10,076 INFO [train.py:996] (3/4) Epoch 7, batch 26650, loss[loss=0.1543, simple_loss=0.2387, pruned_loss=0.03497, over 21581.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2961, pruned_loss=0.07008, over 4262802.78 frames. ], batch size: 230, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:38:22,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1257702.0, ans=0.0 2023-06-25 04:38:28,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.814e+02 2.895e+02 3.400e+02 5.153e+02 1.068e+03, threshold=6.799e+02, percent-clipped=4.0 2023-06-25 04:39:57,579 INFO [train.py:996] (3/4) Epoch 7, batch 26700, loss[loss=0.1934, simple_loss=0.2665, pruned_loss=0.06011, over 21813.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.289, pruned_loss=0.06705, over 4266795.13 frames. ], batch size: 282, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:40:17,613 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:40:21,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1258062.0, ans=0.125 2023-06-25 04:40:26,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1258062.0, ans=0.0 2023-06-25 04:40:35,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1258062.0, ans=0.0 2023-06-25 04:40:50,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1258122.0, ans=0.0 2023-06-25 04:41:52,589 INFO [train.py:996] (3/4) Epoch 7, batch 26750, loss[loss=0.2897, simple_loss=0.3497, pruned_loss=0.1148, over 21383.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.29, pruned_loss=0.06655, over 4275505.79 frames. ], batch size: 507, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:41:53,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1258302.0, ans=0.125 2023-06-25 04:41:58,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1258302.0, ans=0.125 2023-06-25 04:42:06,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 2.716e+02 3.514e+02 4.569e+02 1.217e+03, threshold=7.028e+02, percent-clipped=8.0 2023-06-25 04:42:28,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.04 vs. limit=10.0 2023-06-25 04:43:00,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1258482.0, ans=0.1 2023-06-25 04:43:07,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1258482.0, ans=0.5 2023-06-25 04:43:07,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1258482.0, ans=0.1 2023-06-25 04:43:23,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1258542.0, ans=0.0 2023-06-25 04:43:33,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-25 04:43:37,371 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=22.5 2023-06-25 04:43:43,446 INFO [train.py:996] (3/4) Epoch 7, batch 26800, loss[loss=0.2203, simple_loss=0.2941, pruned_loss=0.07324, over 21860.00 frames. ], tot_loss[loss=0.22, simple_loss=0.298, pruned_loss=0.07095, over 4273065.43 frames. ], batch size: 282, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:44:53,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1258722.0, ans=0.2 2023-06-25 04:44:58,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1258782.0, ans=0.125 2023-06-25 04:45:26,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.57 vs. limit=10.0 2023-06-25 04:45:32,693 INFO [train.py:996] (3/4) Epoch 7, batch 26850, loss[loss=0.208, simple_loss=0.2708, pruned_loss=0.07261, over 21823.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2984, pruned_loss=0.07257, over 4273221.54 frames. ], batch size: 107, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:45:58,734 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.727e+02 3.580e+02 4.511e+02 5.580e+02 1.314e+03, threshold=9.022e+02, percent-clipped=13.0 2023-06-25 04:46:12,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1258962.0, ans=0.2 2023-06-25 04:46:23,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-25 04:46:49,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1259082.0, ans=0.0 2023-06-25 04:47:19,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1259142.0, ans=0.025 2023-06-25 04:47:22,388 INFO [train.py:996] (3/4) Epoch 7, batch 26900, loss[loss=0.1975, simple_loss=0.2546, pruned_loss=0.07023, over 21481.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2897, pruned_loss=0.07212, over 4277711.00 frames. ], batch size: 195, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:47:58,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1259262.0, ans=0.0 2023-06-25 04:48:42,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1259382.0, ans=0.0 2023-06-25 04:48:53,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1259442.0, ans=0.0 2023-06-25 04:49:06,709 INFO [train.py:996] (3/4) Epoch 7, batch 26950, loss[loss=0.2674, simple_loss=0.3552, pruned_loss=0.08977, over 21644.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2896, pruned_loss=0.07205, over 4280029.12 frames. ], batch size: 414, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:49:33,605 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.559e+02 3.020e+02 3.484e+02 4.294e+02 8.554e+02, threshold=6.967e+02, percent-clipped=0.0 2023-06-25 04:50:02,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1259622.0, ans=0.0 2023-06-25 04:50:18,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1259622.0, ans=0.0 2023-06-25 04:50:38,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-25 04:50:48,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.51 vs. limit=12.0 2023-06-25 04:50:49,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-25 04:51:02,039 INFO [train.py:996] (3/4) Epoch 7, batch 27000, loss[loss=0.2253, simple_loss=0.32, pruned_loss=0.06525, over 21638.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2893, pruned_loss=0.06982, over 4280618.94 frames. ], batch size: 442, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:51:02,040 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 04:51:24,282 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2512, simple_loss=0.3463, pruned_loss=0.07806, over 1796401.00 frames. 2023-06-25 04:51:24,283 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 04:51:52,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1259862.0, ans=0.0 2023-06-25 04:52:11,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1259922.0, ans=0.125 2023-06-25 04:52:27,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1259982.0, ans=0.05 2023-06-25 04:52:31,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.56 vs. limit=15.0 2023-06-25 04:52:36,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1259982.0, ans=0.0 2023-06-25 04:53:03,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-25 04:53:06,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1260042.0, ans=0.1 2023-06-25 04:53:14,903 INFO [train.py:996] (3/4) Epoch 7, batch 27050, loss[loss=0.253, simple_loss=0.3223, pruned_loss=0.09183, over 21575.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2909, pruned_loss=0.0664, over 4280389.68 frames. ], batch size: 471, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:53:34,728 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.897e+02 3.762e+02 4.771e+02 8.226e+02, threshold=7.524e+02, percent-clipped=2.0 2023-06-25 04:54:07,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1260222.0, ans=0.2 2023-06-25 04:54:15,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=15.0 2023-06-25 04:54:55,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1260342.0, ans=0.2 2023-06-25 04:55:04,181 INFO [train.py:996] (3/4) Epoch 7, batch 27100, loss[loss=0.2075, simple_loss=0.2766, pruned_loss=0.06921, over 21683.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.293, pruned_loss=0.06756, over 4288112.50 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:55:31,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1260462.0, ans=0.0 2023-06-25 04:55:32,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1260462.0, ans=0.125 2023-06-25 04:55:43,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1260462.0, ans=0.0 2023-06-25 04:56:18,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1260582.0, ans=0.125 2023-06-25 04:56:25,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1260582.0, ans=0.0 2023-06-25 04:56:53,966 INFO [train.py:996] (3/4) Epoch 7, batch 27150, loss[loss=0.253, simple_loss=0.3459, pruned_loss=0.08006, over 21813.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3028, pruned_loss=0.07033, over 4279324.08 frames. ], batch size: 282, lr: 4.19e-03, grad_scale: 16.0 2023-06-25 04:57:16,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1260702.0, ans=0.125 2023-06-25 04:57:19,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.400e+02 4.098e+02 5.830e+02 1.178e+03, threshold=8.196e+02, percent-clipped=9.0 2023-06-25 04:57:26,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1260762.0, ans=0.125 2023-06-25 04:57:33,699 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-25 04:57:43,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1260822.0, ans=0.0 2023-06-25 04:58:29,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1260942.0, ans=0.0 2023-06-25 04:58:53,807 INFO [train.py:996] (3/4) Epoch 7, batch 27200, loss[loss=0.2933, simple_loss=0.3633, pruned_loss=0.1116, over 21461.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3128, pruned_loss=0.07384, over 4279237.05 frames. ], batch size: 471, lr: 4.19e-03, grad_scale: 32.0 2023-06-25 04:58:58,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1261002.0, ans=0.025 2023-06-25 04:59:25,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1261062.0, ans=0.2 2023-06-25 05:00:30,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1261242.0, ans=0.0 2023-06-25 05:00:41,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1261242.0, ans=0.0 2023-06-25 05:00:44,389 INFO [train.py:996] (3/4) Epoch 7, batch 27250, loss[loss=0.2242, simple_loss=0.2969, pruned_loss=0.07577, over 20649.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3165, pruned_loss=0.07846, over 4276977.49 frames. ], batch size: 607, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:01:00,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1261302.0, ans=6.0 2023-06-25 05:01:02,613 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.541e+02 3.239e+02 3.756e+02 4.583e+02 7.251e+02, threshold=7.513e+02, percent-clipped=0.0 2023-06-25 05:02:24,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1261542.0, ans=0.125 2023-06-25 05:02:25,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.83 vs. limit=15.0 2023-06-25 05:02:34,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1261602.0, ans=0.2 2023-06-25 05:02:36,165 INFO [train.py:996] (3/4) Epoch 7, batch 27300, loss[loss=0.2259, simple_loss=0.2866, pruned_loss=0.08258, over 20002.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3178, pruned_loss=0.0798, over 4276560.40 frames. ], batch size: 702, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:02:44,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1261602.0, ans=0.125 2023-06-25 05:02:53,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1261602.0, ans=0.2 2023-06-25 05:02:55,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1261602.0, ans=0.0 2023-06-25 05:04:23,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1261842.0, ans=0.125 2023-06-25 05:04:26,393 INFO [train.py:996] (3/4) Epoch 7, batch 27350, loss[loss=0.2342, simple_loss=0.3127, pruned_loss=0.07787, over 21289.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3201, pruned_loss=0.07987, over 4276760.24 frames. ], batch size: 159, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:04:27,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-25 05:04:42,324 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:04:48,249 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.470e+02 4.790e+02 5.992e+02 9.415e+02, threshold=9.580e+02, percent-clipped=9.0 2023-06-25 05:05:36,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=22.5 2023-06-25 05:06:18,600 INFO [train.py:996] (3/4) Epoch 7, batch 27400, loss[loss=0.1975, simple_loss=0.2673, pruned_loss=0.06384, over 21683.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.315, pruned_loss=0.07916, over 4281053.50 frames. ], batch size: 282, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:07:53,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1262442.0, ans=0.0 2023-06-25 05:08:08,411 INFO [train.py:996] (3/4) Epoch 7, batch 27450, loss[loss=0.2504, simple_loss=0.3296, pruned_loss=0.08563, over 21400.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3086, pruned_loss=0.07735, over 4277952.96 frames. ], batch size: 194, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:08:36,542 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.458e+02 3.140e+02 3.820e+02 5.353e+02 9.307e+02, threshold=7.640e+02, percent-clipped=0.0 2023-06-25 05:08:47,653 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-25 05:09:29,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1262682.0, ans=0.1 2023-06-25 05:09:50,491 INFO [train.py:996] (3/4) Epoch 7, batch 27500, loss[loss=0.2231, simple_loss=0.2945, pruned_loss=0.07581, over 21903.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3065, pruned_loss=0.07741, over 4283723.65 frames. ], batch size: 316, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:09:52,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1262802.0, ans=0.05 2023-06-25 05:10:03,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-25 05:10:15,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1262802.0, ans=0.125 2023-06-25 05:10:16,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1262862.0, ans=0.125 2023-06-25 05:10:50,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1262922.0, ans=0.1 2023-06-25 05:11:16,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1262982.0, ans=0.1 2023-06-25 05:11:43,527 INFO [train.py:996] (3/4) Epoch 7, batch 27550, loss[loss=0.2125, simple_loss=0.2734, pruned_loss=0.07581, over 21221.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.301, pruned_loss=0.07438, over 4277946.62 frames. ], batch size: 176, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:12:10,961 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.562e+02 3.311e+02 4.001e+02 4.826e+02 1.149e+03, threshold=8.002e+02, percent-clipped=4.0 2023-06-25 05:13:22,514 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2023-06-25 05:13:29,710 INFO [train.py:996] (3/4) Epoch 7, batch 27600, loss[loss=0.2003, simple_loss=0.2651, pruned_loss=0.06772, over 21374.00 frames. ], tot_loss[loss=0.221, simple_loss=0.295, pruned_loss=0.07352, over 4276474.25 frames. ], batch size: 177, lr: 4.18e-03, grad_scale: 32.0 2023-06-25 05:13:30,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1263402.0, ans=0.125 2023-06-25 05:13:44,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1263402.0, ans=0.2 2023-06-25 05:15:10,384 INFO [train.py:996] (3/4) Epoch 7, batch 27650, loss[loss=0.1999, simple_loss=0.2885, pruned_loss=0.05564, over 15965.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2891, pruned_loss=0.0726, over 4269496.13 frames. ], batch size: 61, lr: 4.18e-03, grad_scale: 32.0 2023-06-25 05:15:37,187 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.109e+02 3.684e+02 5.059e+02 1.214e+03, threshold=7.368e+02, percent-clipped=6.0 2023-06-25 05:15:44,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1263762.0, ans=0.07 2023-06-25 05:16:24,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1263882.0, ans=0.0 2023-06-25 05:16:57,724 INFO [train.py:996] (3/4) Epoch 7, batch 27700, loss[loss=0.2157, simple_loss=0.2977, pruned_loss=0.06683, over 21733.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2896, pruned_loss=0.07111, over 4262055.95 frames. ], batch size: 247, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:17:34,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1264062.0, ans=0.0 2023-06-25 05:17:44,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.31 vs. limit=10.0 2023-06-25 05:17:45,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-25 05:18:02,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1264122.0, ans=0.1 2023-06-25 05:18:26,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-25 05:18:49,451 INFO [train.py:996] (3/4) Epoch 7, batch 27750, loss[loss=0.1819, simple_loss=0.2717, pruned_loss=0.046, over 21678.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2935, pruned_loss=0.07104, over 4260705.62 frames. ], batch size: 263, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:19:19,113 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 2.962e+02 3.488e+02 4.454e+02 9.416e+02, threshold=6.976e+02, percent-clipped=4.0 2023-06-25 05:19:45,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1264422.0, ans=0.07 2023-06-25 05:20:00,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1264482.0, ans=0.2 2023-06-25 05:20:14,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1264542.0, ans=0.07 2023-06-25 05:20:23,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1264542.0, ans=0.0 2023-06-25 05:20:36,079 INFO [train.py:996] (3/4) Epoch 7, batch 27800, loss[loss=0.2087, simple_loss=0.29, pruned_loss=0.06366, over 21411.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2923, pruned_loss=0.07091, over 4274336.95 frames. ], batch size: 548, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:20:55,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1264602.0, ans=0.025 2023-06-25 05:21:22,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-25 05:21:40,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1264722.0, ans=0.0 2023-06-25 05:21:49,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1264782.0, ans=0.2 2023-06-25 05:22:01,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1264842.0, ans=0.125 2023-06-25 05:22:24,416 INFO [train.py:996] (3/4) Epoch 7, batch 27850, loss[loss=0.2319, simple_loss=0.3147, pruned_loss=0.07457, over 21352.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2924, pruned_loss=0.07194, over 4289931.70 frames. ], batch size: 159, lr: 4.18e-03, grad_scale: 8.0 2023-06-25 05:22:35,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1264902.0, ans=0.1 2023-06-25 05:22:43,501 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=15.0 2023-06-25 05:22:56,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1264962.0, ans=0.125 2023-06-25 05:22:57,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 3.134e+02 3.811e+02 5.096e+02 8.843e+02, threshold=7.621e+02, percent-clipped=7.0 2023-06-25 05:23:11,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-25 05:23:56,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1265142.0, ans=0.125 2023-06-25 05:23:59,114 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:24:27,027 INFO [train.py:996] (3/4) Epoch 7, batch 27900, loss[loss=0.2795, simple_loss=0.3959, pruned_loss=0.08153, over 20801.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3009, pruned_loss=0.07307, over 4287795.64 frames. ], batch size: 607, lr: 4.18e-03, grad_scale: 8.0 2023-06-25 05:25:07,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1265322.0, ans=0.125 2023-06-25 05:25:23,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1265382.0, ans=0.0 2023-06-25 05:26:21,613 INFO [train.py:996] (3/4) Epoch 7, batch 27950, loss[loss=0.2196, simple_loss=0.3089, pruned_loss=0.06513, over 21973.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3001, pruned_loss=0.07013, over 4285952.14 frames. ], batch size: 317, lr: 4.18e-03, grad_scale: 8.0 2023-06-25 05:26:42,603 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 3.117e+02 4.053e+02 5.979e+02 1.114e+03, threshold=8.107e+02, percent-clipped=11.0 2023-06-25 05:27:07,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1265622.0, ans=0.2 2023-06-25 05:28:09,596 INFO [train.py:996] (3/4) Epoch 7, batch 28000, loss[loss=0.228, simple_loss=0.3014, pruned_loss=0.07728, over 21902.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2982, pruned_loss=0.06795, over 4288480.22 frames. ], batch size: 351, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:29:07,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1265982.0, ans=0.2 2023-06-25 05:29:41,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1266042.0, ans=0.0 2023-06-25 05:29:55,319 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:30:00,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1266102.0, ans=0.125 2023-06-25 05:30:01,606 INFO [train.py:996] (3/4) Epoch 7, batch 28050, loss[loss=0.2587, simple_loss=0.3375, pruned_loss=0.08998, over 21549.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.297, pruned_loss=0.06932, over 4294293.48 frames. ], batch size: 471, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:30:02,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1266102.0, ans=0.05 2023-06-25 05:30:22,381 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.952e+02 3.818e+02 5.160e+02 1.220e+03, threshold=7.636e+02, percent-clipped=4.0 2023-06-25 05:30:37,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1266222.0, ans=0.2 2023-06-25 05:31:20,000 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-25 05:31:45,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1266342.0, ans=0.125 2023-06-25 05:31:51,608 INFO [train.py:996] (3/4) Epoch 7, batch 28100, loss[loss=0.1879, simple_loss=0.2505, pruned_loss=0.06267, over 21195.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2927, pruned_loss=0.06931, over 4290097.91 frames. ], batch size: 176, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:31:52,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-25 05:32:46,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1266522.0, ans=0.07 2023-06-25 05:33:40,485 INFO [train.py:996] (3/4) Epoch 7, batch 28150, loss[loss=0.1929, simple_loss=0.2557, pruned_loss=0.065, over 21150.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2863, pruned_loss=0.06944, over 4283335.92 frames. ], batch size: 176, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:33:53,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1266702.0, ans=0.0 2023-06-25 05:33:53,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1266702.0, ans=0.0 2023-06-25 05:34:00,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1266762.0, ans=0.0 2023-06-25 05:34:01,756 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.369e+02 4.176e+02 5.786e+02 1.041e+03, threshold=8.353e+02, percent-clipped=8.0 2023-06-25 05:34:39,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2023-06-25 05:34:54,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1266882.0, ans=0.05 2023-06-25 05:35:09,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1266882.0, ans=0.1 2023-06-25 05:35:29,304 INFO [train.py:996] (3/4) Epoch 7, batch 28200, loss[loss=0.2205, simple_loss=0.2837, pruned_loss=0.07864, over 21775.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2846, pruned_loss=0.07115, over 4284995.20 frames. ], batch size: 107, lr: 4.18e-03, grad_scale: 16.0 2023-06-25 05:36:20,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=12.0 2023-06-25 05:36:56,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.40 vs. limit=8.0 2023-06-25 05:37:17,968 INFO [train.py:996] (3/4) Epoch 7, batch 28250, loss[loss=0.2315, simple_loss=0.288, pruned_loss=0.08745, over 21526.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2887, pruned_loss=0.07387, over 4287449.34 frames. ], batch size: 441, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:37:32,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1267302.0, ans=0.125 2023-06-25 05:37:43,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.555e+02 3.449e+02 4.309e+02 5.866e+02 1.082e+03, threshold=8.618e+02, percent-clipped=6.0 2023-06-25 05:37:49,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1267362.0, ans=0.0 2023-06-25 05:38:54,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1267542.0, ans=0.0 2023-06-25 05:38:56,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1267542.0, ans=0.035 2023-06-25 05:39:08,739 INFO [train.py:996] (3/4) Epoch 7, batch 28300, loss[loss=0.2091, simple_loss=0.3066, pruned_loss=0.05577, over 21497.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2881, pruned_loss=0.07216, over 4275951.06 frames. ], batch size: 471, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:39:48,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1267662.0, ans=0.0 2023-06-25 05:40:07,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-25 05:40:24,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.43 vs. limit=22.5 2023-06-25 05:40:46,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1267842.0, ans=0.2 2023-06-25 05:40:50,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1267842.0, ans=0.0 2023-06-25 05:41:03,570 INFO [train.py:996] (3/4) Epoch 7, batch 28350, loss[loss=0.2271, simple_loss=0.2984, pruned_loss=0.07787, over 21492.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2856, pruned_loss=0.06691, over 4279066.26 frames. ], batch size: 389, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:41:05,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1267902.0, ans=0.0 2023-06-25 05:41:20,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1267902.0, ans=0.0 2023-06-25 05:41:29,704 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.753e+02 3.449e+02 4.988e+02 1.144e+03, threshold=6.899e+02, percent-clipped=4.0 2023-06-25 05:41:49,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1268022.0, ans=0.0 2023-06-25 05:42:10,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1268082.0, ans=0.1 2023-06-25 05:42:11,077 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=12.0 2023-06-25 05:42:18,184 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-25 05:42:44,190 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-25 05:42:51,316 INFO [train.py:996] (3/4) Epoch 7, batch 28400, loss[loss=0.2134, simple_loss=0.2804, pruned_loss=0.07318, over 21586.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2827, pruned_loss=0.06706, over 4280111.62 frames. ], batch size: 231, lr: 4.17e-03, grad_scale: 32.0 2023-06-25 05:43:16,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-25 05:43:28,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1268262.0, ans=0.125 2023-06-25 05:43:28,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1268262.0, ans=0.0 2023-06-25 05:43:39,035 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:44:41,991 INFO [train.py:996] (3/4) Epoch 7, batch 28450, loss[loss=0.2311, simple_loss=0.3009, pruned_loss=0.08065, over 20748.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2876, pruned_loss=0.07043, over 4278317.21 frames. ], batch size: 611, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:44:51,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1268502.0, ans=0.0 2023-06-25 05:44:52,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1268502.0, ans=0.125 2023-06-25 05:45:04,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-25 05:45:15,052 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.249e+02 3.944e+02 5.811e+02 1.668e+03, threshold=7.889e+02, percent-clipped=19.0 2023-06-25 05:45:32,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-25 05:45:45,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1268682.0, ans=0.2 2023-06-25 05:45:49,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1268682.0, ans=0.1 2023-06-25 05:46:22,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1268742.0, ans=0.125 2023-06-25 05:46:34,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1268802.0, ans=0.0 2023-06-25 05:46:36,207 INFO [train.py:996] (3/4) Epoch 7, batch 28500, loss[loss=0.2022, simple_loss=0.2665, pruned_loss=0.06898, over 21259.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2917, pruned_loss=0.07353, over 4288167.76 frames. ], batch size: 608, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:46:54,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1268802.0, ans=0.07 2023-06-25 05:47:29,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1268922.0, ans=0.0 2023-06-25 05:48:16,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-06-25 05:48:31,001 INFO [train.py:996] (3/4) Epoch 7, batch 28550, loss[loss=0.2008, simple_loss=0.2774, pruned_loss=0.06214, over 21873.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3006, pruned_loss=0.07591, over 4288829.76 frames. ], batch size: 98, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:48:53,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.539e+02 3.516e+02 4.419e+02 5.883e+02 1.246e+03, threshold=8.838e+02, percent-clipped=8.0 2023-06-25 05:49:01,620 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 05:49:50,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1269282.0, ans=0.1 2023-06-25 05:49:50,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1269282.0, ans=0.125 2023-06-25 05:49:51,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1269282.0, ans=0.125 2023-06-25 05:50:18,705 INFO [train.py:996] (3/4) Epoch 7, batch 28600, loss[loss=0.2508, simple_loss=0.3256, pruned_loss=0.08806, over 21565.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3058, pruned_loss=0.07775, over 4282525.19 frames. ], batch size: 112, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:50:20,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=22.5 2023-06-25 05:51:25,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1269582.0, ans=0.0 2023-06-25 05:51:34,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1269582.0, ans=0.0 2023-06-25 05:51:38,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1269582.0, ans=0.125 2023-06-25 05:51:39,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-25 05:51:39,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1269582.0, ans=0.2 2023-06-25 05:51:54,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1269642.0, ans=0.0 2023-06-25 05:52:07,661 INFO [train.py:996] (3/4) Epoch 7, batch 28650, loss[loss=0.2796, simple_loss=0.3201, pruned_loss=0.1195, over 21417.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2995, pruned_loss=0.07699, over 4283311.91 frames. ], batch size: 509, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:52:18,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1269702.0, ans=0.2 2023-06-25 05:52:30,246 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.450e+02 3.536e+02 4.575e+02 6.589e+02 8.896e+02, threshold=9.150e+02, percent-clipped=1.0 2023-06-25 05:52:30,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1269762.0, ans=0.1 2023-06-25 05:52:32,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1269762.0, ans=0.0 2023-06-25 05:53:04,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1269822.0, ans=0.1 2023-06-25 05:53:26,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1269882.0, ans=0.125 2023-06-25 05:53:55,686 INFO [train.py:996] (3/4) Epoch 7, batch 28700, loss[loss=0.2148, simple_loss=0.2895, pruned_loss=0.07001, over 21660.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2984, pruned_loss=0.07763, over 4279835.66 frames. ], batch size: 263, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:54:23,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1270062.0, ans=0.0 2023-06-25 05:54:49,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1270122.0, ans=0.125 2023-06-25 05:55:28,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1270242.0, ans=0.125 2023-06-25 05:55:40,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1270242.0, ans=0.125 2023-06-25 05:55:43,748 INFO [train.py:996] (3/4) Epoch 7, batch 28750, loss[loss=0.2408, simple_loss=0.3059, pruned_loss=0.08784, over 21338.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.299, pruned_loss=0.0777, over 4283382.79 frames. ], batch size: 159, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 05:55:47,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1270302.0, ans=0.125 2023-06-25 05:56:04,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=15.0 2023-06-25 05:56:06,400 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.238e+02 3.725e+02 5.020e+02 9.578e+02, threshold=7.449e+02, percent-clipped=2.0 2023-06-25 05:56:26,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1270362.0, ans=0.125 2023-06-25 05:56:51,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1270422.0, ans=0.0 2023-06-25 05:57:30,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1270542.0, ans=0.2 2023-06-25 05:57:33,209 INFO [train.py:996] (3/4) Epoch 7, batch 28800, loss[loss=0.2704, simple_loss=0.3416, pruned_loss=0.09965, over 21339.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3021, pruned_loss=0.07826, over 4282434.40 frames. ], batch size: 159, lr: 4.17e-03, grad_scale: 32.0 2023-06-25 05:57:43,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1270602.0, ans=0.0 2023-06-25 05:57:59,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1270662.0, ans=0.04949747468305833 2023-06-25 05:58:51,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1270782.0, ans=0.2 2023-06-25 05:59:22,079 INFO [train.py:996] (3/4) Epoch 7, batch 28850, loss[loss=0.2045, simple_loss=0.2715, pruned_loss=0.06877, over 21485.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3035, pruned_loss=0.07956, over 4291645.26 frames. ], batch size: 211, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 06:00:02,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.622e+02 3.393e+02 4.119e+02 6.059e+02 1.112e+03, threshold=8.239e+02, percent-clipped=12.0 2023-06-25 06:00:05,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1270962.0, ans=0.125 2023-06-25 06:00:19,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1271022.0, ans=0.125 2023-06-25 06:00:36,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1271082.0, ans=0.125 2023-06-25 06:00:43,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1271082.0, ans=0.1 2023-06-25 06:00:48,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1271082.0, ans=0.0 2023-06-25 06:00:49,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-25 06:01:17,964 INFO [train.py:996] (3/4) Epoch 7, batch 28900, loss[loss=0.2501, simple_loss=0.3278, pruned_loss=0.08617, over 21831.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3078, pruned_loss=0.0815, over 4285905.47 frames. ], batch size: 118, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 06:01:20,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1271202.0, ans=0.05 2023-06-25 06:03:09,346 INFO [train.py:996] (3/4) Epoch 7, batch 28950, loss[loss=0.227, simple_loss=0.3258, pruned_loss=0.0641, over 21760.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3079, pruned_loss=0.08045, over 4278354.07 frames. ], batch size: 351, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:03:41,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1271562.0, ans=0.125 2023-06-25 06:03:46,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.609e+02 4.387e+02 5.987e+02 1.071e+03, threshold=8.774e+02, percent-clipped=6.0 2023-06-25 06:03:57,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.13 vs. limit=6.0 2023-06-25 06:04:39,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1271742.0, ans=0.125 2023-06-25 06:04:54,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1271742.0, ans=0.1 2023-06-25 06:04:57,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.27 vs. limit=6.0 2023-06-25 06:04:58,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1271742.0, ans=0.07 2023-06-25 06:05:02,827 INFO [train.py:996] (3/4) Epoch 7, batch 29000, loss[loss=0.2502, simple_loss=0.3493, pruned_loss=0.07552, over 21622.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3101, pruned_loss=0.07917, over 4274837.46 frames. ], batch size: 441, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:05:05,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1271802.0, ans=0.125 2023-06-25 06:05:30,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1271862.0, ans=0.125 2023-06-25 06:06:21,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.60 vs. limit=6.0 2023-06-25 06:06:47,416 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-25 06:06:51,570 INFO [train.py:996] (3/4) Epoch 7, batch 29050, loss[loss=0.2377, simple_loss=0.3075, pruned_loss=0.08395, over 21950.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3085, pruned_loss=0.07906, over 4272092.76 frames. ], batch size: 333, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:07:21,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.493e+02 3.635e+02 4.186e+02 5.307e+02 1.029e+03, threshold=8.372e+02, percent-clipped=1.0 2023-06-25 06:07:56,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1272282.0, ans=0.125 2023-06-25 06:08:05,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-25 06:08:06,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1272282.0, ans=0.0 2023-06-25 06:08:37,152 INFO [train.py:996] (3/4) Epoch 7, batch 29100, loss[loss=0.2061, simple_loss=0.2613, pruned_loss=0.07548, over 21320.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3, pruned_loss=0.07674, over 4266054.65 frames. ], batch size: 144, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:09:24,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=22.5 2023-06-25 06:09:46,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1272582.0, ans=0.0 2023-06-25 06:09:53,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1272582.0, ans=0.1 2023-06-25 06:09:54,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1272642.0, ans=0.1 2023-06-25 06:10:21,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1272642.0, ans=10.0 2023-06-25 06:10:23,682 INFO [train.py:996] (3/4) Epoch 7, batch 29150, loss[loss=0.205, simple_loss=0.2783, pruned_loss=0.06588, over 21302.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2992, pruned_loss=0.0754, over 4270082.42 frames. ], batch size: 144, lr: 4.17e-03, grad_scale: 8.0 2023-06-25 06:10:36,022 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:10:54,196 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 3.210e+02 4.222e+02 5.476e+02 9.873e+02, threshold=8.444e+02, percent-clipped=1.0 2023-06-25 06:11:08,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1272822.0, ans=0.0 2023-06-25 06:11:12,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-25 06:11:31,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=15.0 2023-06-25 06:11:34,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1272882.0, ans=0.2 2023-06-25 06:11:38,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2023-06-25 06:12:04,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1272942.0, ans=0.0 2023-06-25 06:12:10,633 INFO [train.py:996] (3/4) Epoch 7, batch 29200, loss[loss=0.1848, simple_loss=0.2557, pruned_loss=0.05691, over 21423.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.295, pruned_loss=0.07478, over 4262917.11 frames. ], batch size: 194, lr: 4.17e-03, grad_scale: 16.0 2023-06-25 06:12:52,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-25 06:12:56,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1273122.0, ans=0.1 2023-06-25 06:13:12,251 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.29 vs. limit=22.5 2023-06-25 06:13:50,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1273242.0, ans=0.125 2023-06-25 06:14:04,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.31 vs. limit=22.5 2023-06-25 06:14:05,290 INFO [train.py:996] (3/4) Epoch 7, batch 29250, loss[loss=0.1951, simple_loss=0.2602, pruned_loss=0.06496, over 16674.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2948, pruned_loss=0.07317, over 4261006.10 frames. ], batch size: 61, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:14:09,572 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:14:17,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=22.5 2023-06-25 06:14:19,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1273302.0, ans=0.125 2023-06-25 06:14:31,563 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.162e+02 4.067e+02 5.479e+02 1.081e+03, threshold=8.134e+02, percent-clipped=3.0 2023-06-25 06:14:32,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1273362.0, ans=0.1 2023-06-25 06:14:48,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1273422.0, ans=0.04949747468305833 2023-06-25 06:14:58,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1273422.0, ans=0.125 2023-06-25 06:15:17,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1273482.0, ans=0.2 2023-06-25 06:15:20,149 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-25 06:15:33,593 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-06-25 06:15:36,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1273542.0, ans=0.125 2023-06-25 06:15:53,720 INFO [train.py:996] (3/4) Epoch 7, batch 29300, loss[loss=0.1956, simple_loss=0.2561, pruned_loss=0.06755, over 21777.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2951, pruned_loss=0.0718, over 4263637.80 frames. ], batch size: 124, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:16:06,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-25 06:17:41,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1273902.0, ans=0.0 2023-06-25 06:17:42,131 INFO [train.py:996] (3/4) Epoch 7, batch 29350, loss[loss=0.1912, simple_loss=0.252, pruned_loss=0.06518, over 20007.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2906, pruned_loss=0.07087, over 4264544.82 frames. ], batch size: 702, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:17:51,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1273902.0, ans=0.2 2023-06-25 06:18:13,783 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.302e+02 3.026e+02 3.822e+02 5.352e+02 1.093e+03, threshold=7.644e+02, percent-clipped=3.0 2023-06-25 06:18:28,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1274022.0, ans=0.0 2023-06-25 06:18:30,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1274022.0, ans=0.1 2023-06-25 06:18:30,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1274022.0, ans=0.0 2023-06-25 06:19:15,370 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:19:18,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1274142.0, ans=0.0 2023-06-25 06:19:30,141 INFO [train.py:996] (3/4) Epoch 7, batch 29400, loss[loss=0.1315, simple_loss=0.1879, pruned_loss=0.03757, over 21811.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2916, pruned_loss=0.06921, over 4274758.85 frames. ], batch size: 118, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:20:57,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1274382.0, ans=0.125 2023-06-25 06:21:13,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1274442.0, ans=0.1 2023-06-25 06:21:20,149 INFO [train.py:996] (3/4) Epoch 7, batch 29450, loss[loss=0.2102, simple_loss=0.2803, pruned_loss=0.07004, over 21340.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2927, pruned_loss=0.06931, over 4274824.53 frames. ], batch size: 176, lr: 4.16e-03, grad_scale: 8.0 2023-06-25 06:21:27,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1274502.0, ans=0.125 2023-06-25 06:21:53,721 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 3.532e+02 4.385e+02 5.559e+02 1.410e+03, threshold=8.770e+02, percent-clipped=9.0 2023-06-25 06:22:42,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1274682.0, ans=0.125 2023-06-25 06:23:08,503 INFO [train.py:996] (3/4) Epoch 7, batch 29500, loss[loss=0.233, simple_loss=0.3079, pruned_loss=0.07903, over 20662.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2963, pruned_loss=0.07242, over 4283220.82 frames. ], batch size: 607, lr: 4.16e-03, grad_scale: 8.0 2023-06-25 06:23:28,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1274802.0, ans=15.0 2023-06-25 06:23:29,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=22.5 2023-06-25 06:24:56,247 INFO [train.py:996] (3/4) Epoch 7, batch 29550, loss[loss=0.2075, simple_loss=0.2736, pruned_loss=0.07074, over 21629.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2952, pruned_loss=0.07389, over 4289023.04 frames. ], batch size: 263, lr: 4.16e-03, grad_scale: 8.0 2023-06-25 06:25:14,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1275102.0, ans=0.1 2023-06-25 06:25:15,611 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.06 vs. limit=22.5 2023-06-25 06:25:29,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1275162.0, ans=0.125 2023-06-25 06:25:30,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.666e+02 3.932e+02 4.748e+02 5.685e+02 9.373e+02, threshold=9.495e+02, percent-clipped=3.0 2023-06-25 06:25:41,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1275222.0, ans=0.0 2023-06-25 06:26:45,597 INFO [train.py:996] (3/4) Epoch 7, batch 29600, loss[loss=0.245, simple_loss=0.3364, pruned_loss=0.07678, over 21835.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3001, pruned_loss=0.07588, over 4291084.29 frames. ], batch size: 282, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:27:05,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-06-25 06:27:29,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1275462.0, ans=0.2 2023-06-25 06:27:35,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1275522.0, ans=0.125 2023-06-25 06:27:45,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1275522.0, ans=0.125 2023-06-25 06:28:21,812 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:28:27,641 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-25 06:28:31,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1275702.0, ans=0.125 2023-06-25 06:28:33,313 INFO [train.py:996] (3/4) Epoch 7, batch 29650, loss[loss=0.196, simple_loss=0.2974, pruned_loss=0.0473, over 20799.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2972, pruned_loss=0.07233, over 4282307.01 frames. ], batch size: 608, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:28:52,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1275702.0, ans=0.125 2023-06-25 06:29:08,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1275762.0, ans=0.125 2023-06-25 06:29:15,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1275762.0, ans=0.07 2023-06-25 06:29:16,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.458e+02 4.326e+02 5.325e+02 1.074e+03, threshold=8.651e+02, percent-clipped=3.0 2023-06-25 06:29:44,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.53 vs. limit=12.0 2023-06-25 06:30:27,049 INFO [train.py:996] (3/4) Epoch 7, batch 29700, loss[loss=0.2343, simple_loss=0.3077, pruned_loss=0.0804, over 21737.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2998, pruned_loss=0.07316, over 4290575.85 frames. ], batch size: 441, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:30:29,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1276002.0, ans=0.125 2023-06-25 06:30:32,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1276002.0, ans=0.125 2023-06-25 06:30:52,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1276062.0, ans=0.1 2023-06-25 06:32:01,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1276242.0, ans=0.1 2023-06-25 06:32:16,253 INFO [train.py:996] (3/4) Epoch 7, batch 29750, loss[loss=0.2048, simple_loss=0.2903, pruned_loss=0.05963, over 21499.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3055, pruned_loss=0.0733, over 4290996.25 frames. ], batch size: 211, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:32:23,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1276302.0, ans=0.1 2023-06-25 06:32:54,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.299e+02 3.896e+02 4.722e+02 1.232e+03, threshold=7.792e+02, percent-clipped=5.0 2023-06-25 06:33:03,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1276362.0, ans=0.07 2023-06-25 06:33:27,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1276482.0, ans=0.0 2023-06-25 06:33:27,695 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:34:03,386 INFO [train.py:996] (3/4) Epoch 7, batch 29800, loss[loss=0.215, simple_loss=0.2858, pruned_loss=0.07208, over 21861.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.307, pruned_loss=0.07396, over 4295383.64 frames. ], batch size: 298, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:35:49,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1276902.0, ans=0.125 2023-06-25 06:35:50,570 INFO [train.py:996] (3/4) Epoch 7, batch 29850, loss[loss=0.2297, simple_loss=0.3073, pruned_loss=0.07607, over 21487.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3025, pruned_loss=0.07195, over 4290201.70 frames. ], batch size: 131, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:35:55,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1276902.0, ans=0.125 2023-06-25 06:36:28,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 2.948e+02 3.373e+02 4.045e+02 7.832e+02, threshold=6.745e+02, percent-clipped=1.0 2023-06-25 06:37:06,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-25 06:37:23,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1277142.0, ans=0.125 2023-06-25 06:37:36,840 INFO [train.py:996] (3/4) Epoch 7, batch 29900, loss[loss=0.2375, simple_loss=0.356, pruned_loss=0.05946, over 19801.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3015, pruned_loss=0.07333, over 4292971.37 frames. ], batch size: 703, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:38:26,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1277322.0, ans=0.0 2023-06-25 06:38:26,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1277322.0, ans=0.0 2023-06-25 06:38:28,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1277322.0, ans=0.125 2023-06-25 06:38:47,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1277382.0, ans=0.125 2023-06-25 06:38:54,470 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-25 06:39:25,214 INFO [train.py:996] (3/4) Epoch 7, batch 29950, loss[loss=0.2618, simple_loss=0.3458, pruned_loss=0.08893, over 21820.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3057, pruned_loss=0.07689, over 4295453.05 frames. ], batch size: 124, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:40:04,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1277562.0, ans=0.0 2023-06-25 06:40:08,695 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.749e+02 3.319e+02 4.450e+02 5.387e+02 9.920e+02, threshold=8.899e+02, percent-clipped=12.0 2023-06-25 06:40:10,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1277562.0, ans=0.125 2023-06-25 06:40:20,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1277622.0, ans=0.0 2023-06-25 06:40:38,307 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-06-25 06:40:46,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1277682.0, ans=0.2 2023-06-25 06:41:18,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-25 06:41:19,283 INFO [train.py:996] (3/4) Epoch 7, batch 30000, loss[loss=0.1938, simple_loss=0.2882, pruned_loss=0.04968, over 21717.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3068, pruned_loss=0.0764, over 4293154.12 frames. ], batch size: 247, lr: 4.16e-03, grad_scale: 32.0 2023-06-25 06:41:19,283 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 06:41:39,216 INFO [train.py:1028] (3/4) Epoch 7, validation: loss=0.2493, simple_loss=0.346, pruned_loss=0.07628, over 1796401.00 frames. 2023-06-25 06:41:39,217 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 06:42:12,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1277862.0, ans=0.125 2023-06-25 06:42:14,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1277862.0, ans=0.0 2023-06-25 06:42:45,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1277982.0, ans=0.2 2023-06-25 06:43:30,339 INFO [train.py:996] (3/4) Epoch 7, batch 30050, loss[loss=0.2696, simple_loss=0.3913, pruned_loss=0.07401, over 21155.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3098, pruned_loss=0.07427, over 4281073.06 frames. ], batch size: 548, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:44:05,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.420e+02 3.279e+02 4.155e+02 5.724e+02 1.149e+03, threshold=8.309e+02, percent-clipped=6.0 2023-06-25 06:44:37,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1278222.0, ans=0.2 2023-06-25 06:44:44,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-25 06:45:02,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1278342.0, ans=0.0 2023-06-25 06:45:06,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1278342.0, ans=0.125 2023-06-25 06:45:08,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-06-25 06:45:17,720 INFO [train.py:996] (3/4) Epoch 7, batch 30100, loss[loss=0.2055, simple_loss=0.2675, pruned_loss=0.07175, over 21823.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.307, pruned_loss=0.07285, over 4270952.01 frames. ], batch size: 118, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:45:23,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1278402.0, ans=0.2 2023-06-25 06:45:51,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1278462.0, ans=0.1 2023-06-25 06:46:20,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1278522.0, ans=0.0 2023-06-25 06:46:52,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1278642.0, ans=0.2 2023-06-25 06:47:10,872 INFO [train.py:996] (3/4) Epoch 7, batch 30150, loss[loss=0.246, simple_loss=0.3132, pruned_loss=0.08938, over 21549.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3036, pruned_loss=0.07486, over 4264228.10 frames. ], batch size: 389, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:47:16,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1278702.0, ans=0.0 2023-06-25 06:47:25,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2023-06-25 06:47:33,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1278762.0, ans=0.0 2023-06-25 06:47:44,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1278762.0, ans=0.1 2023-06-25 06:47:47,387 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 3.267e+02 3.809e+02 4.984e+02 9.103e+02, threshold=7.618e+02, percent-clipped=3.0 2023-06-25 06:47:56,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1278822.0, ans=0.0 2023-06-25 06:48:09,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1278822.0, ans=0.125 2023-06-25 06:48:20,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1278882.0, ans=0.04949747468305833 2023-06-25 06:48:46,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1278942.0, ans=0.1 2023-06-25 06:48:56,768 INFO [train.py:996] (3/4) Epoch 7, batch 30200, loss[loss=0.2166, simple_loss=0.3006, pruned_loss=0.06626, over 21719.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3064, pruned_loss=0.07324, over 4271803.99 frames. ], batch size: 298, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:49:01,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1279002.0, ans=0.125 2023-06-25 06:49:37,968 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:49:54,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1279122.0, ans=0.125 2023-06-25 06:49:58,611 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-25 06:50:16,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1279182.0, ans=0.125 2023-06-25 06:50:59,074 INFO [train.py:996] (3/4) Epoch 7, batch 30250, loss[loss=0.2987, simple_loss=0.4056, pruned_loss=0.09591, over 21330.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3142, pruned_loss=0.07601, over 4273276.48 frames. ], batch size: 549, lr: 4.16e-03, grad_scale: 16.0 2023-06-25 06:51:28,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1279362.0, ans=0.0 2023-06-25 06:51:33,132 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.437e+02 3.334e+02 4.601e+02 6.960e+02 1.343e+03, threshold=9.203e+02, percent-clipped=16.0 2023-06-25 06:51:39,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1279422.0, ans=0.125 2023-06-25 06:51:42,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279422.0, ans=0.1 2023-06-25 06:52:34,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1279542.0, ans=0.0 2023-06-25 06:52:41,298 INFO [train.py:996] (3/4) Epoch 7, batch 30300, loss[loss=0.2127, simple_loss=0.2746, pruned_loss=0.07536, over 21516.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3108, pruned_loss=0.07586, over 4276759.60 frames. ], batch size: 414, lr: 4.15e-03, grad_scale: 16.0 2023-06-25 06:53:35,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1279722.0, ans=0.0 2023-06-25 06:53:57,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1279782.0, ans=0.125 2023-06-25 06:54:25,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1279842.0, ans=0.125 2023-06-25 06:54:37,765 INFO [train.py:996] (3/4) Epoch 7, batch 30350, loss[loss=0.2385, simple_loss=0.3225, pruned_loss=0.07728, over 21796.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3113, pruned_loss=0.07687, over 4281293.22 frames. ], batch size: 333, lr: 4.15e-03, grad_scale: 16.0 2023-06-25 06:55:05,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 3.772e+02 4.635e+02 6.721e+02 1.384e+03, threshold=9.269e+02, percent-clipped=9.0 2023-06-25 06:55:15,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1280022.0, ans=0.2 2023-06-25 06:55:35,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1280082.0, ans=0.125 2023-06-25 06:55:38,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1280082.0, ans=15.0 2023-06-25 06:55:58,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=12.0 2023-06-25 06:56:00,341 INFO [train.py:996] (3/4) Epoch 7, batch 30400, loss[loss=0.2003, simple_loss=0.2504, pruned_loss=0.0751, over 20348.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3054, pruned_loss=0.07597, over 4267055.27 frames. ], batch size: 703, lr: 4.15e-03, grad_scale: 32.0 2023-06-25 06:56:01,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-25 06:56:12,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1280202.0, ans=0.09899494936611666 2023-06-25 06:57:25,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1280442.0, ans=0.125 2023-06-25 06:57:33,107 INFO [train.py:996] (3/4) Epoch 7, batch 30450, loss[loss=0.2697, simple_loss=0.3907, pruned_loss=0.07432, over 19755.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3059, pruned_loss=0.07559, over 4206006.44 frames. ], batch size: 702, lr: 4.15e-03, grad_scale: 32.0 2023-06-25 06:57:58,397 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:57:59,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1280562.0, ans=0.125 2023-06-25 06:58:02,608 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.728e+02 6.501e+02 9.013e+02 1.486e+03 3.895e+03, threshold=1.803e+03, percent-clipped=46.0 2023-06-25 06:58:07,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1280622.0, ans=0.0 2023-06-25 07:01:02,063 INFO [train.py:996] (3/4) Epoch 8, batch 0, loss[loss=0.2157, simple_loss=0.2808, pruned_loss=0.07532, over 21228.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2808, pruned_loss=0.07532, over 21228.00 frames. ], batch size: 160, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:01:02,064 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 07:01:19,569 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2406, simple_loss=0.3467, pruned_loss=0.06724, over 1796401.00 frames. 2023-06-25 07:01:19,570 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 07:01:20,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-25 07:03:05,848 INFO [train.py:996] (3/4) Epoch 8, batch 50, loss[loss=0.248, simple_loss=0.3396, pruned_loss=0.07822, over 21773.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3114, pruned_loss=0.07583, over 963420.81 frames. ], batch size: 282, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:03:29,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1281132.0, ans=0.0 2023-06-25 07:03:49,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.736e+02 3.478e+02 5.204e+02 1.094e+03 2.896e+03, threshold=1.041e+03, percent-clipped=7.0 2023-06-25 07:04:51,354 INFO [train.py:996] (3/4) Epoch 8, batch 100, loss[loss=0.1829, simple_loss=0.2503, pruned_loss=0.0578, over 21768.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3218, pruned_loss=0.07793, over 1695893.79 frames. ], batch size: 102, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:05:09,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1281432.0, ans=0.125 2023-06-25 07:06:01,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1281552.0, ans=0.1 2023-06-25 07:06:37,760 INFO [train.py:996] (3/4) Epoch 8, batch 150, loss[loss=0.2636, simple_loss=0.3528, pruned_loss=0.08718, over 21491.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3221, pruned_loss=0.07584, over 2248439.11 frames. ], batch size: 471, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:07:01,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1281732.0, ans=0.09899494936611666 2023-06-25 07:07:27,442 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.475e+02 3.041e+02 3.436e+02 4.359e+02 9.068e+02, threshold=6.872e+02, percent-clipped=0.0 2023-06-25 07:08:00,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1281912.0, ans=0.04949747468305833 2023-06-25 07:08:03,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1281912.0, ans=0.025 2023-06-25 07:08:07,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1281912.0, ans=0.0 2023-06-25 07:08:09,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-25 07:08:18,557 INFO [train.py:996] (3/4) Epoch 8, batch 200, loss[loss=0.1976, simple_loss=0.2645, pruned_loss=0.06538, over 21895.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3193, pruned_loss=0.075, over 2701922.78 frames. ], batch size: 107, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:08:41,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1282032.0, ans=0.1 2023-06-25 07:09:35,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1282152.0, ans=0.0 2023-06-25 07:09:41,815 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:10:00,009 INFO [train.py:996] (3/4) Epoch 8, batch 250, loss[loss=0.2333, simple_loss=0.2986, pruned_loss=0.08399, over 21800.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3134, pruned_loss=0.07402, over 3024627.68 frames. ], batch size: 298, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:10:26,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1282332.0, ans=0.125 2023-06-25 07:10:45,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.611e+02 3.498e+02 4.445e+02 5.647e+02 1.101e+03, threshold=8.891e+02, percent-clipped=14.0 2023-06-25 07:11:24,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1282452.0, ans=0.1 2023-06-25 07:11:37,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1282512.0, ans=0.1 2023-06-25 07:11:49,072 INFO [train.py:996] (3/4) Epoch 8, batch 300, loss[loss=0.2102, simple_loss=0.3174, pruned_loss=0.05148, over 19747.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3082, pruned_loss=0.07338, over 3291201.94 frames. ], batch size: 703, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:12:07,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1282632.0, ans=0.125 2023-06-25 07:12:21,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1282632.0, ans=0.125 2023-06-25 07:13:39,779 INFO [train.py:996] (3/4) Epoch 8, batch 350, loss[loss=0.1901, simple_loss=0.2558, pruned_loss=0.06223, over 21171.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3011, pruned_loss=0.07181, over 3516092.43 frames. ], batch size: 608, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:13:55,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-25 07:14:30,123 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.131e+02 3.897e+02 5.934e+02 1.239e+03, threshold=7.794e+02, percent-clipped=5.0 2023-06-25 07:15:27,509 INFO [train.py:996] (3/4) Epoch 8, batch 400, loss[loss=0.1925, simple_loss=0.2703, pruned_loss=0.05739, over 21657.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2947, pruned_loss=0.07107, over 3684101.39 frames. ], batch size: 247, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:15:45,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1283232.0, ans=0.05 2023-06-25 07:16:44,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1283352.0, ans=0.0 2023-06-25 07:16:56,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1283352.0, ans=0.0 2023-06-25 07:17:08,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-06-25 07:17:16,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1283412.0, ans=0.1 2023-06-25 07:17:19,195 INFO [train.py:996] (3/4) Epoch 8, batch 450, loss[loss=0.2466, simple_loss=0.3438, pruned_loss=0.07471, over 21226.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2925, pruned_loss=0.07, over 3817080.37 frames. ], batch size: 159, lr: 3.86e-03, grad_scale: 32.0 2023-06-25 07:17:20,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1283472.0, ans=0.125 2023-06-25 07:17:25,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=22.5 2023-06-25 07:17:27,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-25 07:18:16,648 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 3.535e+02 4.359e+02 5.649e+02 1.208e+03, threshold=8.718e+02, percent-clipped=9.0 2023-06-25 07:18:19,060 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:19:01,978 INFO [train.py:996] (3/4) Epoch 8, batch 500, loss[loss=0.2657, simple_loss=0.3826, pruned_loss=0.07445, over 21249.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2955, pruned_loss=0.06934, over 3921266.67 frames. ], batch size: 548, lr: 3.86e-03, grad_scale: 16.0 2023-06-25 07:19:21,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1283772.0, ans=0.0 2023-06-25 07:19:45,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1283892.0, ans=0.125 2023-06-25 07:20:29,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1283952.0, ans=0.125 2023-06-25 07:20:49,148 INFO [train.py:996] (3/4) Epoch 8, batch 550, loss[loss=0.3587, simple_loss=0.4412, pruned_loss=0.1381, over 21453.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2989, pruned_loss=0.06874, over 4004398.41 frames. ], batch size: 507, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:21:05,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1284072.0, ans=0.125 2023-06-25 07:21:11,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1284132.0, ans=0.125 2023-06-25 07:21:45,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.578e+02 5.101e+02 7.574e+02 1.639e+03, threshold=1.020e+03, percent-clipped=17.0 2023-06-25 07:22:28,846 INFO [train.py:996] (3/4) Epoch 8, batch 600, loss[loss=0.2194, simple_loss=0.2851, pruned_loss=0.07691, over 22014.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3015, pruned_loss=0.06944, over 4059732.08 frames. ], batch size: 103, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:24:03,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1284612.0, ans=0.0 2023-06-25 07:24:14,789 INFO [train.py:996] (3/4) Epoch 8, batch 650, loss[loss=0.2094, simple_loss=0.3124, pruned_loss=0.05322, over 21401.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3018, pruned_loss=0.06988, over 4103806.45 frames. ], batch size: 211, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:24:23,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-25 07:24:37,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1284732.0, ans=0.0 2023-06-25 07:25:16,311 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.313e+02 4.571e+02 7.176e+02 1.629e+03, threshold=9.143e+02, percent-clipped=10.0 2023-06-25 07:25:20,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1284792.0, ans=0.125 2023-06-25 07:25:23,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1284792.0, ans=0.2 2023-06-25 07:26:00,075 INFO [train.py:996] (3/4) Epoch 8, batch 700, loss[loss=0.26, simple_loss=0.3223, pruned_loss=0.09885, over 21321.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3036, pruned_loss=0.07068, over 4146615.31 frames. ], batch size: 471, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:26:02,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1284972.0, ans=0.125 2023-06-25 07:26:04,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1284972.0, ans=0.0 2023-06-25 07:26:22,390 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2023-06-25 07:27:06,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1285092.0, ans=0.1 2023-06-25 07:27:20,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1285152.0, ans=0.125 2023-06-25 07:27:44,312 INFO [train.py:996] (3/4) Epoch 8, batch 750, loss[loss=0.2621, simple_loss=0.3909, pruned_loss=0.06664, over 19794.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3041, pruned_loss=0.07165, over 4172780.41 frames. ], batch size: 702, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:28:08,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1285332.0, ans=0.125 2023-06-25 07:28:47,152 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.536e+02 3.601e+02 4.438e+02 5.764e+02 1.140e+03, threshold=8.877e+02, percent-clipped=3.0 2023-06-25 07:29:12,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1285452.0, ans=0.2 2023-06-25 07:29:32,230 INFO [train.py:996] (3/4) Epoch 8, batch 800, loss[loss=0.2337, simple_loss=0.2813, pruned_loss=0.09303, over 21496.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3018, pruned_loss=0.07221, over 4200985.47 frames. ], batch size: 508, lr: 3.85e-03, grad_scale: 32.0 2023-06-25 07:29:34,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1285572.0, ans=0.125 2023-06-25 07:30:34,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1285692.0, ans=0.0 2023-06-25 07:31:25,130 INFO [train.py:996] (3/4) Epoch 8, batch 850, loss[loss=0.2022, simple_loss=0.2726, pruned_loss=0.06586, over 21822.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2984, pruned_loss=0.07176, over 4217938.29 frames. ], batch size: 298, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:31:25,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1285872.0, ans=0.125 2023-06-25 07:32:23,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.482e+02 3.234e+02 3.833e+02 4.866e+02 9.722e+02, threshold=7.666e+02, percent-clipped=1.0 2023-06-25 07:32:43,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1286052.0, ans=0.125 2023-06-25 07:33:13,028 INFO [train.py:996] (3/4) Epoch 8, batch 900, loss[loss=0.1934, simple_loss=0.2681, pruned_loss=0.05933, over 21324.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2954, pruned_loss=0.0704, over 4232835.32 frames. ], batch size: 159, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:33:26,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-25 07:33:59,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1286232.0, ans=0.125 2023-06-25 07:34:00,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-25 07:34:07,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=22.5 2023-06-25 07:34:13,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1286292.0, ans=0.2 2023-06-25 07:34:41,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.15 vs. limit=6.0 2023-06-25 07:35:01,296 INFO [train.py:996] (3/4) Epoch 8, batch 950, loss[loss=0.2553, simple_loss=0.3272, pruned_loss=0.09173, over 21807.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2931, pruned_loss=0.06978, over 4248153.07 frames. ], batch size: 414, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:35:39,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1286532.0, ans=0.0 2023-06-25 07:35:54,400 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.907e+02 3.602e+02 4.628e+02 6.707e+02 1.446e+03, threshold=9.256e+02, percent-clipped=20.0 2023-06-25 07:36:03,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1286652.0, ans=0.0 2023-06-25 07:36:37,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.33 vs. limit=10.0 2023-06-25 07:36:42,695 INFO [train.py:996] (3/4) Epoch 8, batch 1000, loss[loss=0.2216, simple_loss=0.287, pruned_loss=0.07807, over 21648.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.294, pruned_loss=0.0709, over 4263647.06 frames. ], batch size: 263, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:37:42,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1286892.0, ans=0.125 2023-06-25 07:38:23,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1287012.0, ans=0.125 2023-06-25 07:38:31,299 INFO [train.py:996] (3/4) Epoch 8, batch 1050, loss[loss=0.1981, simple_loss=0.2732, pruned_loss=0.06151, over 21278.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2922, pruned_loss=0.07013, over 4272135.29 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:39:30,739 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.534e+02 3.439e+02 4.407e+02 5.715e+02 1.308e+03, threshold=8.815e+02, percent-clipped=4.0 2023-06-25 07:40:19,269 INFO [train.py:996] (3/4) Epoch 8, batch 1100, loss[loss=0.1759, simple_loss=0.2586, pruned_loss=0.04658, over 21193.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2938, pruned_loss=0.07026, over 4281022.69 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:41:09,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1287492.0, ans=0.1 2023-06-25 07:41:47,068 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-25 07:42:15,463 INFO [train.py:996] (3/4) Epoch 8, batch 1150, loss[loss=0.2021, simple_loss=0.2817, pruned_loss=0.06122, over 21488.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2931, pruned_loss=0.07061, over 4281523.56 frames. ], batch size: 548, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:42:29,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1287672.0, ans=0.0 2023-06-25 07:42:40,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1287732.0, ans=0.2 2023-06-25 07:42:40,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1287732.0, ans=0.0 2023-06-25 07:42:44,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1287732.0, ans=0.0 2023-06-25 07:42:59,617 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.638e+02 3.529e+02 4.325e+02 5.726e+02 1.140e+03, threshold=8.649e+02, percent-clipped=5.0 2023-06-25 07:43:18,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1287852.0, ans=0.2 2023-06-25 07:43:50,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1287912.0, ans=0.1 2023-06-25 07:43:55,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1287912.0, ans=0.2 2023-06-25 07:43:55,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1287912.0, ans=0.125 2023-06-25 07:43:59,487 INFO [train.py:996] (3/4) Epoch 8, batch 1200, loss[loss=0.1981, simple_loss=0.2478, pruned_loss=0.07425, over 20370.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2954, pruned_loss=0.07101, over 4280605.92 frames. ], batch size: 703, lr: 3.85e-03, grad_scale: 32.0 2023-06-25 07:44:05,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1287972.0, ans=0.125 2023-06-25 07:44:38,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-06-25 07:44:39,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1288092.0, ans=0.0 2023-06-25 07:44:43,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1288092.0, ans=0.0 2023-06-25 07:44:59,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1288152.0, ans=0.125 2023-06-25 07:45:02,251 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-06-25 07:45:08,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1288152.0, ans=0.125 2023-06-25 07:45:47,955 INFO [train.py:996] (3/4) Epoch 8, batch 1250, loss[loss=0.1987, simple_loss=0.2782, pruned_loss=0.05957, over 21127.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2972, pruned_loss=0.07218, over 4281964.60 frames. ], batch size: 607, lr: 3.85e-03, grad_scale: 32.0 2023-06-25 07:45:55,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1288272.0, ans=0.125 2023-06-25 07:46:06,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1288332.0, ans=0.1 2023-06-25 07:46:15,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1288332.0, ans=0.0 2023-06-25 07:46:19,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1288332.0, ans=0.2 2023-06-25 07:46:38,024 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.488e+02 3.316e+02 4.127e+02 5.335e+02 1.234e+03, threshold=8.255e+02, percent-clipped=5.0 2023-06-25 07:47:36,786 INFO [train.py:996] (3/4) Epoch 8, batch 1300, loss[loss=0.279, simple_loss=0.3367, pruned_loss=0.1106, over 21707.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2974, pruned_loss=0.07167, over 4283920.15 frames. ], batch size: 507, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:47:54,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1288632.0, ans=0.125 2023-06-25 07:47:55,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-25 07:48:33,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1288752.0, ans=0.2 2023-06-25 07:48:36,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1288752.0, ans=0.125 2023-06-25 07:48:40,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.80 vs. limit=15.0 2023-06-25 07:48:43,680 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:49:17,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1288812.0, ans=0.1 2023-06-25 07:49:25,887 INFO [train.py:996] (3/4) Epoch 8, batch 1350, loss[loss=0.2554, simple_loss=0.334, pruned_loss=0.08841, over 21615.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3003, pruned_loss=0.07354, over 4286691.19 frames. ], batch size: 471, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:50:11,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-25 07:50:14,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1288992.0, ans=0.125 2023-06-25 07:50:14,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1288992.0, ans=0.125 2023-06-25 07:50:15,507 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.606e+02 3.456e+02 4.378e+02 5.897e+02 1.151e+03, threshold=8.757e+02, percent-clipped=2.0 2023-06-25 07:50:40,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1289052.0, ans=0.125 2023-06-25 07:51:03,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1289112.0, ans=0.0 2023-06-25 07:51:08,356 INFO [train.py:996] (3/4) Epoch 8, batch 1400, loss[loss=0.2247, simple_loss=0.2926, pruned_loss=0.07843, over 21518.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2989, pruned_loss=0.07302, over 4288292.38 frames. ], batch size: 548, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:51:19,675 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-25 07:51:42,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1289292.0, ans=0.0 2023-06-25 07:51:57,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1289292.0, ans=0.0 2023-06-25 07:52:01,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1289292.0, ans=0.2 2023-06-25 07:52:54,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1289412.0, ans=0.2 2023-06-25 07:52:57,311 INFO [train.py:996] (3/4) Epoch 8, batch 1450, loss[loss=0.1845, simple_loss=0.2526, pruned_loss=0.05815, over 21682.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2985, pruned_loss=0.07283, over 4290226.47 frames. ], batch size: 247, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:53:05,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1289472.0, ans=0.1 2023-06-25 07:53:08,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1289472.0, ans=0.125 2023-06-25 07:53:47,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1289592.0, ans=0.015 2023-06-25 07:53:48,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.442e+02 4.414e+02 6.258e+02 1.881e+03, threshold=8.827e+02, percent-clipped=13.0 2023-06-25 07:54:46,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-25 07:54:47,197 INFO [train.py:996] (3/4) Epoch 8, batch 1500, loss[loss=0.2438, simple_loss=0.2982, pruned_loss=0.09475, over 21627.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2995, pruned_loss=0.07328, over 4289451.37 frames. ], batch size: 507, lr: 3.85e-03, grad_scale: 8.0 2023-06-25 07:54:52,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1289772.0, ans=0.125 2023-06-25 07:55:09,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1289832.0, ans=0.2 2023-06-25 07:55:29,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1289892.0, ans=0.125 2023-06-25 07:55:29,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1289892.0, ans=0.0 2023-06-25 07:56:13,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1289952.0, ans=0.125 2023-06-25 07:56:27,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1290012.0, ans=10.0 2023-06-25 07:56:40,533 INFO [train.py:996] (3/4) Epoch 8, batch 1550, loss[loss=0.2355, simple_loss=0.3308, pruned_loss=0.07007, over 20912.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2962, pruned_loss=0.07128, over 4285882.64 frames. ], batch size: 607, lr: 3.85e-03, grad_scale: 8.0 2023-06-25 07:56:41,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1290072.0, ans=0.125 2023-06-25 07:57:17,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1290132.0, ans=0.1 2023-06-25 07:57:35,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 3.681e+02 5.239e+02 6.621e+02 1.108e+03, threshold=1.048e+03, percent-clipped=5.0 2023-06-25 07:58:01,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1290252.0, ans=0.0 2023-06-25 07:58:33,657 INFO [train.py:996] (3/4) Epoch 8, batch 1600, loss[loss=0.2411, simple_loss=0.3495, pruned_loss=0.06634, over 21285.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2961, pruned_loss=0.07167, over 4285542.69 frames. ], batch size: 548, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 07:58:38,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1290372.0, ans=0.125 2023-06-25 07:59:02,394 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:59:07,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1290432.0, ans=0.125 2023-06-25 07:59:48,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1290492.0, ans=0.0 2023-06-25 08:00:27,009 INFO [train.py:996] (3/4) Epoch 8, batch 1650, loss[loss=0.2599, simple_loss=0.3298, pruned_loss=0.09497, over 21801.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2967, pruned_loss=0.07189, over 4280977.02 frames. ], batch size: 124, lr: 3.85e-03, grad_scale: 16.0 2023-06-25 08:01:38,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.337e+02 4.261e+02 5.571e+02 1.006e+03, threshold=8.522e+02, percent-clipped=0.0 2023-06-25 08:02:04,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1290912.0, ans=0.0 2023-06-25 08:02:09,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1290912.0, ans=0.125 2023-06-25 08:02:20,387 INFO [train.py:996] (3/4) Epoch 8, batch 1700, loss[loss=0.233, simple_loss=0.2951, pruned_loss=0.08546, over 21853.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3, pruned_loss=0.07304, over 4283644.63 frames. ], batch size: 441, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:02:36,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1290972.0, ans=0.0 2023-06-25 08:02:45,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-25 08:03:15,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-25 08:03:22,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1291092.0, ans=0.0 2023-06-25 08:03:40,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1291152.0, ans=0.0 2023-06-25 08:04:20,212 INFO [train.py:996] (3/4) Epoch 8, batch 1750, loss[loss=0.1731, simple_loss=0.2579, pruned_loss=0.0442, over 21570.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2963, pruned_loss=0.07029, over 4274990.49 frames. ], batch size: 230, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:04:22,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1291272.0, ans=0.0 2023-06-25 08:05:09,456 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=12.0 2023-06-25 08:05:26,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.271e+02 4.291e+02 6.912e+02 1.295e+03, threshold=8.582e+02, percent-clipped=12.0 2023-06-25 08:06:19,510 INFO [train.py:996] (3/4) Epoch 8, batch 1800, loss[loss=0.1798, simple_loss=0.2468, pruned_loss=0.05636, over 21352.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2949, pruned_loss=0.06834, over 4270683.45 frames. ], batch size: 211, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:07:19,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1291692.0, ans=0.0 2023-06-25 08:07:24,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.38 vs. limit=15.0 2023-06-25 08:08:10,412 INFO [train.py:996] (3/4) Epoch 8, batch 1850, loss[loss=0.2483, simple_loss=0.3382, pruned_loss=0.07921, over 21368.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2952, pruned_loss=0.06757, over 4272727.91 frames. ], batch size: 549, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:08:46,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1291932.0, ans=0.0 2023-06-25 08:08:46,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1291932.0, ans=0.125 2023-06-25 08:08:52,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1291932.0, ans=0.0 2023-06-25 08:09:00,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1291992.0, ans=0.2 2023-06-25 08:09:08,862 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.967e+02 5.452e+02 7.986e+02 1.937e+03, threshold=1.090e+03, percent-clipped=22.0 2023-06-25 08:09:11,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1291992.0, ans=0.95 2023-06-25 08:09:24,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1292052.0, ans=0.0 2023-06-25 08:10:05,924 INFO [train.py:996] (3/4) Epoch 8, batch 1900, loss[loss=0.2278, simple_loss=0.2974, pruned_loss=0.07912, over 21739.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2964, pruned_loss=0.06832, over 4270770.23 frames. ], batch size: 389, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:10:29,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1292232.0, ans=0.5 2023-06-25 08:10:40,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1292232.0, ans=0.5 2023-06-25 08:11:16,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1292352.0, ans=0.1 2023-06-25 08:11:24,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1292352.0, ans=0.125 2023-06-25 08:11:35,152 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=15.0 2023-06-25 08:12:04,380 INFO [train.py:996] (3/4) Epoch 8, batch 1950, loss[loss=0.2086, simple_loss=0.2858, pruned_loss=0.06568, over 21874.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2937, pruned_loss=0.06841, over 4275016.02 frames. ], batch size: 373, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:12:13,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1292472.0, ans=0.0 2023-06-25 08:12:30,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.22 vs. limit=15.0 2023-06-25 08:12:31,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1292532.0, ans=0.0 2023-06-25 08:12:44,944 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-25 08:12:47,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1292592.0, ans=0.0 2023-06-25 08:13:00,219 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.618e+02 4.190e+02 5.257e+02 7.093e+02 1.583e+03, threshold=1.051e+03, percent-clipped=6.0 2023-06-25 08:13:52,818 INFO [train.py:996] (3/4) Epoch 8, batch 2000, loss[loss=0.1911, simple_loss=0.2649, pruned_loss=0.05864, over 21844.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2907, pruned_loss=0.06713, over 4281532.06 frames. ], batch size: 118, lr: 3.84e-03, grad_scale: 32.0 2023-06-25 08:14:07,654 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 08:14:10,853 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=2.693e-03 2023-06-25 08:14:46,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1292892.0, ans=0.2 2023-06-25 08:15:44,213 INFO [train.py:996] (3/4) Epoch 8, batch 2050, loss[loss=0.2099, simple_loss=0.2999, pruned_loss=0.05998, over 21632.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2952, pruned_loss=0.06808, over 4289645.51 frames. ], batch size: 263, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:16:05,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1293132.0, ans=0.125 2023-06-25 08:16:39,107 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.662e+02 4.169e+02 5.197e+02 7.491e+02 1.738e+03, threshold=1.039e+03, percent-clipped=10.0 2023-06-25 08:16:51,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1293252.0, ans=0.125 2023-06-25 08:17:35,790 INFO [train.py:996] (3/4) Epoch 8, batch 2100, loss[loss=0.2205, simple_loss=0.3086, pruned_loss=0.06622, over 21877.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2978, pruned_loss=0.06924, over 4282886.31 frames. ], batch size: 316, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:17:47,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1293372.0, ans=0.125 2023-06-25 08:18:04,440 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:18:07,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1293432.0, ans=0.0 2023-06-25 08:18:48,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1293552.0, ans=0.1 2023-06-25 08:19:27,066 INFO [train.py:996] (3/4) Epoch 8, batch 2150, loss[loss=0.2083, simple_loss=0.2808, pruned_loss=0.06792, over 21723.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2953, pruned_loss=0.06971, over 4278112.00 frames. ], batch size: 351, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:19:38,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1293672.0, ans=0.1 2023-06-25 08:20:18,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1293792.0, ans=0.0 2023-06-25 08:20:23,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.526e+02 3.335e+02 3.972e+02 5.687e+02 1.021e+03, threshold=7.943e+02, percent-clipped=0.0 2023-06-25 08:20:28,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1293852.0, ans=0.125 2023-06-25 08:20:30,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1293852.0, ans=0.0 2023-06-25 08:21:00,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1293912.0, ans=0.0 2023-06-25 08:21:19,279 INFO [train.py:996] (3/4) Epoch 8, batch 2200, loss[loss=0.2215, simple_loss=0.3115, pruned_loss=0.06571, over 21645.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2963, pruned_loss=0.06995, over 4267593.69 frames. ], batch size: 389, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:21:43,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1294032.0, ans=0.1 2023-06-25 08:21:49,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1294032.0, ans=0.0 2023-06-25 08:21:51,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=22.5 2023-06-25 08:23:08,636 INFO [train.py:996] (3/4) Epoch 8, batch 2250, loss[loss=0.2116, simple_loss=0.2867, pruned_loss=0.06821, over 21613.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2923, pruned_loss=0.06877, over 4272046.93 frames. ], batch size: 442, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:23:54,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1294392.0, ans=0.125 2023-06-25 08:23:58,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-25 08:24:02,843 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.466e+02 3.638e+02 4.452e+02 6.050e+02 1.629e+03, threshold=8.904e+02, percent-clipped=11.0 2023-06-25 08:24:44,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1294512.0, ans=0.025 2023-06-25 08:24:52,815 INFO [train.py:996] (3/4) Epoch 8, batch 2300, loss[loss=0.2237, simple_loss=0.2792, pruned_loss=0.08408, over 21601.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.288, pruned_loss=0.06785, over 4274552.05 frames. ], batch size: 415, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:24:55,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1294572.0, ans=0.0 2023-06-25 08:25:05,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1294572.0, ans=0.0 2023-06-25 08:25:23,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1294632.0, ans=15.0 2023-06-25 08:25:30,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1294692.0, ans=0.0 2023-06-25 08:26:46,472 INFO [train.py:996] (3/4) Epoch 8, batch 2350, loss[loss=0.2219, simple_loss=0.3132, pruned_loss=0.06529, over 21788.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2876, pruned_loss=0.06942, over 4266681.95 frames. ], batch size: 351, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:27:41,217 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.567e+02 4.172e+02 5.399e+02 7.196e+02 1.286e+03, threshold=1.080e+03, percent-clipped=11.0 2023-06-25 08:27:45,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1295052.0, ans=0.07 2023-06-25 08:28:37,750 INFO [train.py:996] (3/4) Epoch 8, batch 2400, loss[loss=0.2462, simple_loss=0.3189, pruned_loss=0.08671, over 21725.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2931, pruned_loss=0.07207, over 4269640.12 frames. ], batch size: 298, lr: 3.84e-03, grad_scale: 32.0 2023-06-25 08:28:45,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1295172.0, ans=0.0 2023-06-25 08:28:55,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1295232.0, ans=0.125 2023-06-25 08:29:23,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1295292.0, ans=0.0 2023-06-25 08:30:27,363 INFO [train.py:996] (3/4) Epoch 8, batch 2450, loss[loss=0.2004, simple_loss=0.3173, pruned_loss=0.04173, over 20793.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2957, pruned_loss=0.07345, over 4271899.81 frames. ], batch size: 608, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:31:21,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1295592.0, ans=0.125 2023-06-25 08:31:24,807 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.841e+02 6.208e+02 9.164e+02 1.809e+03, threshold=1.242e+03, percent-clipped=16.0 2023-06-25 08:32:12,771 INFO [train.py:996] (3/4) Epoch 8, batch 2500, loss[loss=0.2022, simple_loss=0.294, pruned_loss=0.05523, over 21514.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2964, pruned_loss=0.07231, over 4274953.98 frames. ], batch size: 389, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:33:18,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1295952.0, ans=0.0 2023-06-25 08:33:47,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1296012.0, ans=0.125 2023-06-25 08:33:59,262 INFO [train.py:996] (3/4) Epoch 8, batch 2550, loss[loss=0.204, simple_loss=0.2702, pruned_loss=0.06888, over 21818.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2938, pruned_loss=0.07155, over 4271578.76 frames. ], batch size: 317, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:34:10,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1296072.0, ans=0.1 2023-06-25 08:34:46,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.88 vs. limit=15.0 2023-06-25 08:34:56,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.632e+02 3.347e+02 3.968e+02 6.148e+02 1.129e+03, threshold=7.936e+02, percent-clipped=0.0 2023-06-25 08:35:31,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1296312.0, ans=0.0 2023-06-25 08:35:40,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1296312.0, ans=0.125 2023-06-25 08:35:41,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1296312.0, ans=0.125 2023-06-25 08:35:49,662 INFO [train.py:996] (3/4) Epoch 8, batch 2600, loss[loss=0.2644, simple_loss=0.3312, pruned_loss=0.09878, over 21794.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2947, pruned_loss=0.07288, over 4273063.65 frames. ], batch size: 441, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:36:07,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1296432.0, ans=0.1 2023-06-25 08:37:40,602 INFO [train.py:996] (3/4) Epoch 8, batch 2650, loss[loss=0.2298, simple_loss=0.3049, pruned_loss=0.07729, over 21615.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2957, pruned_loss=0.07389, over 4279053.71 frames. ], batch size: 131, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:37:42,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1296672.0, ans=0.125 2023-06-25 08:38:00,866 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:38:37,353 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.828e+02 4.857e+02 7.020e+02 1.360e+03, threshold=9.714e+02, percent-clipped=21.0 2023-06-25 08:39:13,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1296912.0, ans=0.125 2023-06-25 08:39:24,880 INFO [train.py:996] (3/4) Epoch 8, batch 2700, loss[loss=0.1358, simple_loss=0.1875, pruned_loss=0.04208, over 16283.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.295, pruned_loss=0.07288, over 4269541.95 frames. ], batch size: 61, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:39:52,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1297032.0, ans=0.0 2023-06-25 08:41:11,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1297212.0, ans=0.0 2023-06-25 08:41:17,917 INFO [train.py:996] (3/4) Epoch 8, batch 2750, loss[loss=0.208, simple_loss=0.2805, pruned_loss=0.06776, over 21826.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2959, pruned_loss=0.07238, over 4266166.22 frames. ], batch size: 298, lr: 3.84e-03, grad_scale: 16.0 2023-06-25 08:41:24,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1297272.0, ans=0.0 2023-06-25 08:42:27,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.672e+02 4.055e+02 5.362e+02 7.595e+02 1.481e+03, threshold=1.072e+03, percent-clipped=12.0 2023-06-25 08:43:11,595 INFO [train.py:996] (3/4) Epoch 8, batch 2800, loss[loss=0.2218, simple_loss=0.29, pruned_loss=0.07683, over 21365.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.302, pruned_loss=0.07435, over 4274283.08 frames. ], batch size: 549, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 08:43:15,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1297572.0, ans=0.0 2023-06-25 08:43:38,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1297632.0, ans=0.125 2023-06-25 08:43:41,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1297632.0, ans=0.125 2023-06-25 08:43:50,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1297632.0, ans=0.125 2023-06-25 08:43:54,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1297692.0, ans=0.125 2023-06-25 08:44:13,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1297692.0, ans=0.2 2023-06-25 08:44:58,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1297872.0, ans=0.0 2023-06-25 08:44:59,917 INFO [train.py:996] (3/4) Epoch 8, batch 2850, loss[loss=0.2081, simple_loss=0.2909, pruned_loss=0.06265, over 21759.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3049, pruned_loss=0.0759, over 4278995.71 frames. ], batch size: 351, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 08:45:22,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1297932.0, ans=0.2 2023-06-25 08:45:27,491 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=22.5 2023-06-25 08:46:13,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.766e+02 3.662e+02 5.066e+02 7.139e+02 1.545e+03, threshold=1.013e+03, percent-clipped=5.0 2023-06-25 08:46:14,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-25 08:46:17,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1298052.0, ans=0.2 2023-06-25 08:46:50,070 INFO [train.py:996] (3/4) Epoch 8, batch 2900, loss[loss=0.2277, simple_loss=0.2937, pruned_loss=0.08085, over 21482.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2993, pruned_loss=0.07468, over 4274838.92 frames. ], batch size: 194, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:47:20,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1298232.0, ans=0.0 2023-06-25 08:48:20,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1298352.0, ans=0.125 2023-06-25 08:48:40,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1298472.0, ans=0.0 2023-06-25 08:48:42,073 INFO [train.py:996] (3/4) Epoch 8, batch 2950, loss[loss=0.2096, simple_loss=0.3014, pruned_loss=0.05892, over 21592.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2998, pruned_loss=0.0741, over 4281723.17 frames. ], batch size: 230, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:48:55,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1298472.0, ans=0.125 2023-06-25 08:49:00,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1298472.0, ans=0.0 2023-06-25 08:49:57,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.695e+02 3.497e+02 4.851e+02 7.009e+02 1.350e+03, threshold=9.702e+02, percent-clipped=11.0 2023-06-25 08:50:33,448 INFO [train.py:996] (3/4) Epoch 8, batch 3000, loss[loss=0.2518, simple_loss=0.3246, pruned_loss=0.08946, over 21546.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3046, pruned_loss=0.07514, over 4288365.34 frames. ], batch size: 131, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:50:33,448 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 08:50:54,964 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2557, simple_loss=0.3462, pruned_loss=0.08265, over 1796401.00 frames. 2023-06-25 08:50:54,965 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 08:52:06,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1298952.0, ans=0.0 2023-06-25 08:52:10,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1298952.0, ans=0.125 2023-06-25 08:52:45,443 INFO [train.py:996] (3/4) Epoch 8, batch 3050, loss[loss=0.1896, simple_loss=0.2653, pruned_loss=0.05696, over 21511.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3031, pruned_loss=0.07348, over 4283627.35 frames. ], batch size: 194, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:52:56,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1299072.0, ans=0.125 2023-06-25 08:53:55,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 3.327e+02 3.997e+02 5.438e+02 1.383e+03, threshold=7.994e+02, percent-clipped=4.0 2023-06-25 08:54:35,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1299372.0, ans=0.125 2023-06-25 08:54:37,055 INFO [train.py:996] (3/4) Epoch 8, batch 3100, loss[loss=0.2048, simple_loss=0.29, pruned_loss=0.05974, over 21706.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.304, pruned_loss=0.07313, over 4284889.79 frames. ], batch size: 247, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:55:31,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.39 vs. limit=22.5 2023-06-25 08:56:39,318 INFO [train.py:996] (3/4) Epoch 8, batch 3150, loss[loss=0.2328, simple_loss=0.3169, pruned_loss=0.0743, over 21729.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3051, pruned_loss=0.07329, over 4284777.58 frames. ], batch size: 298, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 08:56:45,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1299672.0, ans=0.0 2023-06-25 08:56:58,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1299672.0, ans=0.5 2023-06-25 08:57:31,338 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-25 08:57:44,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 3.426e+02 4.350e+02 5.969e+02 1.538e+03, threshold=8.700e+02, percent-clipped=12.0 2023-06-25 08:58:26,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299912.0, ans=0.1 2023-06-25 08:58:36,655 INFO [train.py:996] (3/4) Epoch 8, batch 3200, loss[loss=0.232, simple_loss=0.3068, pruned_loss=0.07861, over 21340.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3069, pruned_loss=0.07424, over 4280110.32 frames. ], batch size: 176, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 08:59:33,838 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-25 09:00:01,453 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-25 09:00:23,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-25 09:00:28,350 INFO [train.py:996] (3/4) Epoch 8, batch 3250, loss[loss=0.215, simple_loss=0.276, pruned_loss=0.07695, over 21683.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.308, pruned_loss=0.0762, over 4282056.48 frames. ], batch size: 333, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:01:07,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1300332.0, ans=0.125 2023-06-25 09:01:09,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1300392.0, ans=0.2 2023-06-25 09:01:20,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1300392.0, ans=0.125 2023-06-25 09:01:27,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=15.0 2023-06-25 09:01:30,047 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.648e+02 3.942e+02 5.285e+02 9.066e+02 2.066e+03, threshold=1.057e+03, percent-clipped=29.0 2023-06-25 09:02:20,247 INFO [train.py:996] (3/4) Epoch 8, batch 3300, loss[loss=0.2085, simple_loss=0.3022, pruned_loss=0.05738, over 21664.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3059, pruned_loss=0.07559, over 4267619.73 frames. ], batch size: 298, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:02:23,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.08 vs. limit=10.0 2023-06-25 09:02:40,170 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:02:53,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1300632.0, ans=0.0 2023-06-25 09:03:00,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1300692.0, ans=0.125 2023-06-25 09:04:03,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1300812.0, ans=0.05 2023-06-25 09:04:11,674 INFO [train.py:996] (3/4) Epoch 8, batch 3350, loss[loss=0.217, simple_loss=0.2911, pruned_loss=0.07148, over 21384.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3071, pruned_loss=0.07605, over 4274809.50 frames. ], batch size: 131, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:04:20,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-25 09:04:28,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1300932.0, ans=0.0 2023-06-25 09:04:59,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1300992.0, ans=0.125 2023-06-25 09:05:18,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-25 09:05:23,410 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 4.026e+02 5.637e+02 8.126e+02 1.843e+03, threshold=1.127e+03, percent-clipped=12.0 2023-06-25 09:05:25,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1301052.0, ans=0.05 2023-06-25 09:05:45,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1301112.0, ans=0.0 2023-06-25 09:06:01,859 INFO [train.py:996] (3/4) Epoch 8, batch 3400, loss[loss=0.194, simple_loss=0.2709, pruned_loss=0.05855, over 21663.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3051, pruned_loss=0.07533, over 4277723.16 frames. ], batch size: 247, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:06:39,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1301232.0, ans=0.125 2023-06-25 09:06:39,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-25 09:07:55,852 INFO [train.py:996] (3/4) Epoch 8, batch 3450, loss[loss=0.246, simple_loss=0.3109, pruned_loss=0.0905, over 21820.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3015, pruned_loss=0.07484, over 4282833.80 frames. ], batch size: 441, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:08:11,343 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-25 09:08:57,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1301592.0, ans=0.1 2023-06-25 09:09:13,935 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.637e+02 3.588e+02 4.974e+02 7.725e+02 1.763e+03, threshold=9.948e+02, percent-clipped=11.0 2023-06-25 09:09:53,619 INFO [train.py:996] (3/4) Epoch 8, batch 3500, loss[loss=0.2578, simple_loss=0.329, pruned_loss=0.09328, over 21256.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3106, pruned_loss=0.07873, over 4285494.46 frames. ], batch size: 159, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:10:06,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1301772.0, ans=0.0 2023-06-25 09:10:24,344 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-25 09:11:43,838 INFO [train.py:996] (3/4) Epoch 8, batch 3550, loss[loss=0.1853, simple_loss=0.2445, pruned_loss=0.06311, over 19857.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3127, pruned_loss=0.08001, over 4285339.95 frames. ], batch size: 703, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:12:12,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1302132.0, ans=0.125 2023-06-25 09:12:24,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1302132.0, ans=0.1 2023-06-25 09:12:27,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1302132.0, ans=0.1 2023-06-25 09:12:42,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1302192.0, ans=15.0 2023-06-25 09:12:55,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 4.003e+02 5.383e+02 7.230e+02 1.174e+03, threshold=1.077e+03, percent-clipped=7.0 2023-06-25 09:13:02,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-25 09:13:02,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-25 09:13:32,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1302312.0, ans=0.1 2023-06-25 09:13:35,412 INFO [train.py:996] (3/4) Epoch 8, batch 3600, loss[loss=0.2237, simple_loss=0.2821, pruned_loss=0.08264, over 21842.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3057, pruned_loss=0.07903, over 4285850.49 frames. ], batch size: 98, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:13:57,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1302432.0, ans=0.125 2023-06-25 09:14:05,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-25 09:14:13,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1302432.0, ans=0.07 2023-06-25 09:15:18,541 INFO [train.py:996] (3/4) Epoch 8, batch 3650, loss[loss=0.2062, simple_loss=0.2921, pruned_loss=0.06016, over 21778.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3046, pruned_loss=0.07806, over 4283439.20 frames. ], batch size: 332, lr: 3.83e-03, grad_scale: 32.0 2023-06-25 09:15:23,169 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-25 09:16:00,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1302732.0, ans=0.125 2023-06-25 09:16:10,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1302792.0, ans=0.0 2023-06-25 09:16:31,573 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.652e+02 4.088e+02 5.545e+02 7.819e+02 1.547e+03, threshold=1.109e+03, percent-clipped=4.0 2023-06-25 09:16:44,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1302852.0, ans=0.125 2023-06-25 09:16:53,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1302912.0, ans=0.0 2023-06-25 09:17:09,592 INFO [train.py:996] (3/4) Epoch 8, batch 3700, loss[loss=0.2242, simple_loss=0.2977, pruned_loss=0.07534, over 21804.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3025, pruned_loss=0.07713, over 4293271.72 frames. ], batch size: 107, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:17:34,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1303032.0, ans=0.125 2023-06-25 09:17:55,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1303092.0, ans=0.125 2023-06-25 09:18:14,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1303092.0, ans=0.125 2023-06-25 09:18:24,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1303152.0, ans=0.125 2023-06-25 09:18:27,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1303152.0, ans=0.125 2023-06-25 09:19:01,404 INFO [train.py:996] (3/4) Epoch 8, batch 3750, loss[loss=0.1915, simple_loss=0.2681, pruned_loss=0.05749, over 21823.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3008, pruned_loss=0.076, over 4291708.95 frames. ], batch size: 298, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:19:17,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1303272.0, ans=0.125 2023-06-25 09:19:38,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.98 vs. limit=15.0 2023-06-25 09:20:09,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1303392.0, ans=0.125 2023-06-25 09:20:21,197 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.272e+02 4.501e+02 6.560e+02 9.292e+02, threshold=9.001e+02, percent-clipped=0.0 2023-06-25 09:20:58,387 INFO [train.py:996] (3/4) Epoch 8, batch 3800, loss[loss=0.2524, simple_loss=0.3254, pruned_loss=0.08964, over 21559.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2975, pruned_loss=0.07372, over 4285547.75 frames. ], batch size: 389, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:21:01,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.94 vs. limit=5.0 2023-06-25 09:21:35,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-25 09:21:40,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1303632.0, ans=0.125 2023-06-25 09:21:52,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1303692.0, ans=10.0 2023-06-25 09:21:52,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.95 vs. limit=10.0 2023-06-25 09:21:53,977 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:22:07,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1303752.0, ans=0.04949747468305833 2023-06-25 09:22:40,690 INFO [train.py:996] (3/4) Epoch 8, batch 3850, loss[loss=0.1902, simple_loss=0.255, pruned_loss=0.06269, over 21635.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2954, pruned_loss=0.07444, over 4286455.32 frames. ], batch size: 298, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:23:15,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1303932.0, ans=0.125 2023-06-25 09:23:17,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1303932.0, ans=10.0 2023-06-25 09:23:39,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1303992.0, ans=0.125 2023-06-25 09:23:43,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=22.5 2023-06-25 09:23:52,004 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=22.5 2023-06-25 09:23:59,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.585e+02 3.373e+02 4.487e+02 6.167e+02 2.000e+03, threshold=8.974e+02, percent-clipped=6.0 2023-06-25 09:24:02,108 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:24:31,273 INFO [train.py:996] (3/4) Epoch 8, batch 3900, loss[loss=0.2132, simple_loss=0.281, pruned_loss=0.07268, over 21867.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.292, pruned_loss=0.07409, over 4277723.17 frames. ], batch size: 371, lr: 3.83e-03, grad_scale: 16.0 2023-06-25 09:25:30,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1304292.0, ans=0.2 2023-06-25 09:26:18,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1304412.0, ans=0.125 2023-06-25 09:26:27,160 INFO [train.py:996] (3/4) Epoch 8, batch 3950, loss[loss=0.2255, simple_loss=0.3135, pruned_loss=0.06877, over 21786.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2944, pruned_loss=0.07341, over 4277486.68 frames. ], batch size: 282, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:27:02,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1304532.0, ans=0.2 2023-06-25 09:27:17,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-06-25 09:27:38,821 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.574e+02 3.686e+02 5.186e+02 7.402e+02 1.424e+03, threshold=1.037e+03, percent-clipped=9.0 2023-06-25 09:28:16,203 INFO [train.py:996] (3/4) Epoch 8, batch 4000, loss[loss=0.2063, simple_loss=0.2624, pruned_loss=0.07509, over 21903.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2877, pruned_loss=0.06967, over 4277051.30 frames. ], batch size: 113, lr: 3.82e-03, grad_scale: 32.0 2023-06-25 09:30:11,496 INFO [train.py:996] (3/4) Epoch 8, batch 4050, loss[loss=0.2017, simple_loss=0.2856, pruned_loss=0.05892, over 21792.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2872, pruned_loss=0.06834, over 4276107.77 frames. ], batch size: 332, lr: 3.82e-03, grad_scale: 32.0 2023-06-25 09:31:14,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=22.5 2023-06-25 09:31:18,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.508e+02 3.803e+02 4.888e+02 6.657e+02 1.371e+03, threshold=9.776e+02, percent-clipped=4.0 2023-06-25 09:31:43,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1305312.0, ans=0.125 2023-06-25 09:31:59,980 INFO [train.py:996] (3/4) Epoch 8, batch 4100, loss[loss=0.2042, simple_loss=0.2905, pruned_loss=0.05902, over 21783.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2893, pruned_loss=0.06941, over 4281979.21 frames. ], batch size: 332, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:32:27,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1305432.0, ans=0.125 2023-06-25 09:32:34,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1305432.0, ans=0.125 2023-06-25 09:32:36,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-25 09:32:52,088 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:33:12,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1305552.0, ans=0.125 2023-06-25 09:33:31,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1305612.0, ans=15.0 2023-06-25 09:33:48,822 INFO [train.py:996] (3/4) Epoch 8, batch 4150, loss[loss=0.2101, simple_loss=0.2917, pruned_loss=0.06421, over 21593.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2895, pruned_loss=0.06644, over 4289971.66 frames. ], batch size: 414, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:34:10,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1305732.0, ans=0.125 2023-06-25 09:34:17,061 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-25 09:35:00,807 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.587e+02 3.172e+02 3.844e+02 5.295e+02 7.953e+02, threshold=7.689e+02, percent-clipped=0.0 2023-06-25 09:35:37,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1305912.0, ans=0.0 2023-06-25 09:35:41,057 INFO [train.py:996] (3/4) Epoch 8, batch 4200, loss[loss=0.2962, simple_loss=0.3793, pruned_loss=0.1065, over 21453.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2913, pruned_loss=0.06727, over 4287495.82 frames. ], batch size: 471, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:35:41,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1305972.0, ans=0.125 2023-06-25 09:37:10,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1306152.0, ans=0.125 2023-06-25 09:37:15,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1306212.0, ans=0.125 2023-06-25 09:37:38,030 INFO [train.py:996] (3/4) Epoch 8, batch 4250, loss[loss=0.252, simple_loss=0.3338, pruned_loss=0.08512, over 21755.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2969, pruned_loss=0.06858, over 4274238.29 frames. ], batch size: 124, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:38:11,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1306332.0, ans=0.125 2023-06-25 09:38:22,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1306392.0, ans=0.025 2023-06-25 09:38:57,638 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.607e+02 4.053e+02 6.185e+02 8.917e+02 1.733e+03, threshold=1.237e+03, percent-clipped=33.0 2023-06-25 09:39:38,305 INFO [train.py:996] (3/4) Epoch 8, batch 4300, loss[loss=0.2373, simple_loss=0.3466, pruned_loss=0.06399, over 21241.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3063, pruned_loss=0.0718, over 4271790.03 frames. ], batch size: 548, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:40:36,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1306692.0, ans=0.1 2023-06-25 09:41:28,175 INFO [train.py:996] (3/4) Epoch 8, batch 4350, loss[loss=0.1837, simple_loss=0.2467, pruned_loss=0.06033, over 21595.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3043, pruned_loss=0.07068, over 4271228.08 frames. ], batch size: 231, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:41:58,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1306932.0, ans=0.0 2023-06-25 09:42:03,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1306932.0, ans=0.125 2023-06-25 09:42:07,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1306992.0, ans=0.2 2023-06-25 09:42:44,506 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 3.580e+02 4.513e+02 6.539e+02 1.169e+03, threshold=9.025e+02, percent-clipped=0.0 2023-06-25 09:42:52,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1307052.0, ans=0.0 2023-06-25 09:43:19,205 INFO [train.py:996] (3/4) Epoch 8, batch 4400, loss[loss=0.2168, simple_loss=0.2802, pruned_loss=0.07676, over 21148.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2988, pruned_loss=0.07013, over 4265559.27 frames. ], batch size: 143, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:43:32,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1307172.0, ans=0.125 2023-06-25 09:43:36,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-06-25 09:43:50,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1307232.0, ans=0.2 2023-06-25 09:45:05,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1307412.0, ans=0.125 2023-06-25 09:45:16,018 INFO [train.py:996] (3/4) Epoch 8, batch 4450, loss[loss=0.2495, simple_loss=0.3459, pruned_loss=0.07656, over 21649.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3063, pruned_loss=0.07131, over 4254214.35 frames. ], batch size: 263, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:45:50,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1307532.0, ans=0.125 2023-06-25 09:46:12,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1307592.0, ans=0.2 2023-06-25 09:46:32,170 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.688e+02 3.788e+02 5.957e+02 8.951e+02 1.705e+03, threshold=1.191e+03, percent-clipped=23.0 2023-06-25 09:46:33,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-25 09:46:34,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1307652.0, ans=0.125 2023-06-25 09:46:55,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1307712.0, ans=0.125 2023-06-25 09:46:55,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1307712.0, ans=0.125 2023-06-25 09:47:06,080 INFO [train.py:996] (3/4) Epoch 8, batch 4500, loss[loss=0.2624, simple_loss=0.3409, pruned_loss=0.09197, over 21732.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3063, pruned_loss=0.07243, over 4263125.86 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:47:19,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1307772.0, ans=0.125 2023-06-25 09:47:21,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1307772.0, ans=0.1 2023-06-25 09:47:45,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1307832.0, ans=0.0 2023-06-25 09:48:01,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-06-25 09:48:22,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1307952.0, ans=0.125 2023-06-25 09:48:56,026 INFO [train.py:996] (3/4) Epoch 8, batch 4550, loss[loss=0.2958, simple_loss=0.36, pruned_loss=0.1158, over 21323.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3083, pruned_loss=0.07273, over 4268636.80 frames. ], batch size: 507, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:49:59,464 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:50:18,031 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.595e+02 3.343e+02 4.134e+02 5.307e+02 1.038e+03, threshold=8.269e+02, percent-clipped=0.0 2023-06-25 09:50:45,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1308312.0, ans=0.125 2023-06-25 09:50:52,056 INFO [train.py:996] (3/4) Epoch 8, batch 4600, loss[loss=0.2264, simple_loss=0.2919, pruned_loss=0.0805, over 21165.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3115, pruned_loss=0.07494, over 4271720.91 frames. ], batch size: 608, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:51:14,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1308432.0, ans=0.2 2023-06-25 09:51:35,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1308432.0, ans=0.0 2023-06-25 09:51:51,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1308492.0, ans=0.125 2023-06-25 09:52:12,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1308552.0, ans=0.125 2023-06-25 09:52:42,569 INFO [train.py:996] (3/4) Epoch 8, batch 4650, loss[loss=0.2337, simple_loss=0.2946, pruned_loss=0.08637, over 21726.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3068, pruned_loss=0.07398, over 4280301.34 frames. ], batch size: 441, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:53:33,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1308792.0, ans=0.125 2023-06-25 09:53:34,262 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-25 09:53:42,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1308792.0, ans=0.0 2023-06-25 09:53:42,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1308792.0, ans=0.125 2023-06-25 09:53:47,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1308792.0, ans=0.0 2023-06-25 09:53:59,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.213e+02 3.806e+02 5.357e+02 1.908e+03, threshold=7.612e+02, percent-clipped=10.0 2023-06-25 09:54:03,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1308852.0, ans=0.025 2023-06-25 09:54:31,183 INFO [train.py:996] (3/4) Epoch 8, batch 4700, loss[loss=0.1865, simple_loss=0.2528, pruned_loss=0.06007, over 21691.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2962, pruned_loss=0.07136, over 4278900.73 frames. ], batch size: 282, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:54:49,091 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:55:07,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1309032.0, ans=0.07 2023-06-25 09:55:20,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.97 vs. limit=15.0 2023-06-25 09:55:33,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1309092.0, ans=0.0 2023-06-25 09:55:48,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1309152.0, ans=0.125 2023-06-25 09:55:53,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1309152.0, ans=0.125 2023-06-25 09:56:21,239 INFO [train.py:996] (3/4) Epoch 8, batch 4750, loss[loss=0.2023, simple_loss=0.2705, pruned_loss=0.06706, over 21292.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2899, pruned_loss=0.07112, over 4281984.62 frames. ], batch size: 159, lr: 3.82e-03, grad_scale: 8.0 2023-06-25 09:56:38,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-25 09:57:39,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.746e+02 3.551e+02 4.538e+02 6.106e+02 1.235e+03, threshold=9.075e+02, percent-clipped=15.0 2023-06-25 09:58:00,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1309512.0, ans=0.1 2023-06-25 09:58:10,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1309572.0, ans=0.125 2023-06-25 09:58:17,094 INFO [train.py:996] (3/4) Epoch 8, batch 4800, loss[loss=0.2652, simple_loss=0.3576, pruned_loss=0.08638, over 21532.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2911, pruned_loss=0.07249, over 4285429.62 frames. ], batch size: 471, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 09:58:48,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.78 vs. limit=15.0 2023-06-25 09:58:50,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1309632.0, ans=0.0 2023-06-25 09:59:18,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1309752.0, ans=0.125 2023-06-25 09:59:21,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1309752.0, ans=0.125 2023-06-25 09:59:59,466 INFO [train.py:996] (3/4) Epoch 8, batch 4850, loss[loss=0.2823, simple_loss=0.3336, pruned_loss=0.1155, over 21637.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2905, pruned_loss=0.07092, over 4278274.02 frames. ], batch size: 507, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:00:33,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1309932.0, ans=0.2 2023-06-25 10:00:46,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1309992.0, ans=0.2 2023-06-25 10:01:16,357 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.775e+02 3.669e+02 4.660e+02 6.748e+02 1.065e+03, threshold=9.320e+02, percent-clipped=5.0 2023-06-25 10:01:48,357 INFO [train.py:996] (3/4) Epoch 8, batch 4900, loss[loss=0.2347, simple_loss=0.333, pruned_loss=0.0682, over 21721.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2921, pruned_loss=0.07166, over 4285099.96 frames. ], batch size: 351, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:03:04,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-25 10:03:47,912 INFO [train.py:996] (3/4) Epoch 8, batch 4950, loss[loss=0.1862, simple_loss=0.2823, pruned_loss=0.04501, over 21739.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2959, pruned_loss=0.0703, over 4278573.78 frames. ], batch size: 351, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:04:09,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1310532.0, ans=0.125 2023-06-25 10:04:13,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1310532.0, ans=0.09899494936611666 2023-06-25 10:04:16,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1310532.0, ans=0.0 2023-06-25 10:04:16,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1310532.0, ans=0.1 2023-06-25 10:04:44,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1310652.0, ans=0.125 2023-06-25 10:05:00,801 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 3.071e+02 4.183e+02 5.786e+02 1.763e+03, threshold=8.366e+02, percent-clipped=8.0 2023-06-25 10:05:37,414 INFO [train.py:996] (3/4) Epoch 8, batch 5000, loss[loss=0.2369, simple_loss=0.3114, pruned_loss=0.08125, over 21841.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2952, pruned_loss=0.06732, over 4276893.38 frames. ], batch size: 371, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:05:38,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1310772.0, ans=0.125 2023-06-25 10:05:41,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1310772.0, ans=0.125 2023-06-25 10:06:14,824 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-06-25 10:06:41,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1310952.0, ans=0.2 2023-06-25 10:06:53,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1310952.0, ans=0.1 2023-06-25 10:07:19,128 INFO [train.py:996] (3/4) Epoch 8, batch 5050, loss[loss=0.2123, simple_loss=0.2916, pruned_loss=0.06652, over 21567.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2951, pruned_loss=0.06836, over 4277839.77 frames. ], batch size: 195, lr: 3.82e-03, grad_scale: 16.0 2023-06-25 10:07:50,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1311132.0, ans=0.125 2023-06-25 10:08:24,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1311252.0, ans=0.125 2023-06-25 10:08:25,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1311252.0, ans=0.0 2023-06-25 10:08:30,088 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 3.598e+02 4.329e+02 6.155e+02 1.761e+03, threshold=8.658e+02, percent-clipped=10.0 2023-06-25 10:08:36,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-25 10:08:43,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1311312.0, ans=0.0 2023-06-25 10:08:43,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1311312.0, ans=0.125 2023-06-25 10:09:07,151 INFO [train.py:996] (3/4) Epoch 8, batch 5100, loss[loss=0.195, simple_loss=0.2655, pruned_loss=0.06231, over 21835.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2935, pruned_loss=0.06931, over 4280860.48 frames. ], batch size: 282, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:09:42,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1311432.0, ans=0.1 2023-06-25 10:10:52,983 INFO [train.py:996] (3/4) Epoch 8, batch 5150, loss[loss=0.218, simple_loss=0.2851, pruned_loss=0.07541, over 21597.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2921, pruned_loss=0.06933, over 4278208.67 frames. ], batch size: 263, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:11:09,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1311672.0, ans=0.125 2023-06-25 10:11:15,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-25 10:11:27,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1311732.0, ans=0.025 2023-06-25 10:11:55,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1311792.0, ans=0.0 2023-06-25 10:12:11,329 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.655e+02 3.617e+02 5.481e+02 7.313e+02 1.650e+03, threshold=1.096e+03, percent-clipped=16.0 2023-06-25 10:12:22,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1311852.0, ans=0.125 2023-06-25 10:12:48,514 INFO [train.py:996] (3/4) Epoch 8, batch 5200, loss[loss=0.2698, simple_loss=0.3694, pruned_loss=0.08508, over 21230.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2968, pruned_loss=0.0707, over 4273764.91 frames. ], batch size: 548, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:13:15,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1312032.0, ans=0.125 2023-06-25 10:13:15,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1312032.0, ans=0.125 2023-06-25 10:14:43,356 INFO [train.py:996] (3/4) Epoch 8, batch 5250, loss[loss=0.195, simple_loss=0.2749, pruned_loss=0.05758, over 21770.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3007, pruned_loss=0.06901, over 4268786.24 frames. ], batch size: 112, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:14:50,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1312272.0, ans=0.125 2023-06-25 10:15:30,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1312392.0, ans=0.1 2023-06-25 10:15:53,837 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.659e+02 3.587e+02 4.772e+02 6.547e+02 1.598e+03, threshold=9.543e+02, percent-clipped=4.0 2023-06-25 10:15:59,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1312452.0, ans=0.125 2023-06-25 10:16:29,953 INFO [train.py:996] (3/4) Epoch 8, batch 5300, loss[loss=0.2163, simple_loss=0.2843, pruned_loss=0.07417, over 21862.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2994, pruned_loss=0.06966, over 4275865.93 frames. ], batch size: 282, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:16:40,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1312572.0, ans=0.07 2023-06-25 10:17:06,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1312632.0, ans=0.07 2023-06-25 10:17:32,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1312752.0, ans=0.0 2023-06-25 10:18:16,995 INFO [train.py:996] (3/4) Epoch 8, batch 5350, loss[loss=0.2073, simple_loss=0.2756, pruned_loss=0.06952, over 21821.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2987, pruned_loss=0.07151, over 4276977.10 frames. ], batch size: 247, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:18:21,434 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:19:07,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1312992.0, ans=0.1 2023-06-25 10:19:28,376 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.684e+02 3.542e+02 4.424e+02 5.994e+02 1.106e+03, threshold=8.848e+02, percent-clipped=4.0 2023-06-25 10:19:29,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1313052.0, ans=0.125 2023-06-25 10:19:39,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1313112.0, ans=0.0 2023-06-25 10:19:41,210 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:19:50,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1313112.0, ans=15.0 2023-06-25 10:20:05,535 INFO [train.py:996] (3/4) Epoch 8, batch 5400, loss[loss=0.2572, simple_loss=0.396, pruned_loss=0.0592, over 19740.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2982, pruned_loss=0.07293, over 4277633.82 frames. ], batch size: 702, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:20:48,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1313292.0, ans=0.125 2023-06-25 10:21:32,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1313412.0, ans=0.2 2023-06-25 10:21:55,102 INFO [train.py:996] (3/4) Epoch 8, batch 5450, loss[loss=0.2099, simple_loss=0.2967, pruned_loss=0.06156, over 21383.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2992, pruned_loss=0.07103, over 4275980.45 frames. ], batch size: 131, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:22:27,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1313532.0, ans=0.1 2023-06-25 10:22:56,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1313592.0, ans=0.125 2023-06-25 10:23:15,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 4.381e+02 6.345e+02 1.127e+03 2.400e+03, threshold=1.269e+03, percent-clipped=34.0 2023-06-25 10:23:39,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1313712.0, ans=0.125 2023-06-25 10:23:45,621 INFO [train.py:996] (3/4) Epoch 8, batch 5500, loss[loss=0.1795, simple_loss=0.274, pruned_loss=0.04247, over 21667.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3022, pruned_loss=0.06872, over 4274149.42 frames. ], batch size: 247, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:24:02,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1313832.0, ans=0.125 2023-06-25 10:24:04,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1313832.0, ans=0.125 2023-06-25 10:24:14,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1313832.0, ans=0.025 2023-06-25 10:24:57,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1313952.0, ans=0.125 2023-06-25 10:25:35,622 INFO [train.py:996] (3/4) Epoch 8, batch 5550, loss[loss=0.1871, simple_loss=0.2809, pruned_loss=0.04662, over 21672.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.3023, pruned_loss=0.06615, over 4271243.62 frames. ], batch size: 247, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:25:49,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1314072.0, ans=0.0 2023-06-25 10:26:31,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-25 10:27:03,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.121e+02 4.354e+02 6.729e+02 1.471e+03, threshold=8.708e+02, percent-clipped=1.0 2023-06-25 10:27:24,342 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-25 10:27:26,970 INFO [train.py:996] (3/4) Epoch 8, batch 5600, loss[loss=0.2133, simple_loss=0.2946, pruned_loss=0.06598, over 21057.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.3007, pruned_loss=0.06351, over 4277111.13 frames. ], batch size: 143, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:28:02,790 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:28:27,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1314492.0, ans=0.125 2023-06-25 10:28:41,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-25 10:28:44,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1314552.0, ans=0.125 2023-06-25 10:29:15,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-25 10:29:15,396 INFO [train.py:996] (3/4) Epoch 8, batch 5650, loss[loss=0.2815, simple_loss=0.3375, pruned_loss=0.1127, over 21688.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.3043, pruned_loss=0.06642, over 4284427.43 frames. ], batch size: 507, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:29:39,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.08 vs. limit=10.0 2023-06-25 10:29:48,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1314732.0, ans=0.125 2023-06-25 10:30:35,082 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.47 vs. limit=22.5 2023-06-25 10:30:42,456 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.720e+02 4.225e+02 5.470e+02 8.803e+02 1.575e+03, threshold=1.094e+03, percent-clipped=25.0 2023-06-25 10:30:48,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1314912.0, ans=0.0 2023-06-25 10:30:49,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1314912.0, ans=0.1 2023-06-25 10:31:12,019 INFO [train.py:996] (3/4) Epoch 8, batch 5700, loss[loss=0.2053, simple_loss=0.2685, pruned_loss=0.07107, over 21275.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3027, pruned_loss=0.06798, over 4279641.47 frames. ], batch size: 608, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:31:57,417 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:32:53,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1315212.0, ans=0.125 2023-06-25 10:33:14,814 INFO [train.py:996] (3/4) Epoch 8, batch 5750, loss[loss=0.1758, simple_loss=0.2728, pruned_loss=0.03944, over 21638.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2993, pruned_loss=0.06578, over 4272610.85 frames. ], batch size: 247, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:33:33,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1315272.0, ans=0.125 2023-06-25 10:33:45,850 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.17 vs. limit=15.0 2023-06-25 10:33:48,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1315332.0, ans=0.2 2023-06-25 10:33:48,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1315332.0, ans=0.0 2023-06-25 10:34:12,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1315452.0, ans=0.0 2023-06-25 10:34:12,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1315452.0, ans=0.0 2023-06-25 10:34:31,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.956e+02 5.585e+02 8.690e+02 2.193e+03, threshold=1.117e+03, percent-clipped=12.0 2023-06-25 10:34:51,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1315512.0, ans=10.0 2023-06-25 10:35:03,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1315572.0, ans=0.125 2023-06-25 10:35:05,088 INFO [train.py:996] (3/4) Epoch 8, batch 5800, loss[loss=0.2266, simple_loss=0.3266, pruned_loss=0.06336, over 21767.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2966, pruned_loss=0.0638, over 4262440.16 frames. ], batch size: 332, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:35:46,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1315692.0, ans=0.02 2023-06-25 10:36:36,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1315812.0, ans=0.125 2023-06-25 10:36:36,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-25 10:36:48,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1315812.0, ans=0.05 2023-06-25 10:36:55,265 INFO [train.py:996] (3/4) Epoch 8, batch 5850, loss[loss=0.1672, simple_loss=0.2625, pruned_loss=0.03599, over 21388.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2954, pruned_loss=0.06052, over 4271164.94 frames. ], batch size: 211, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:37:25,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1315932.0, ans=0.2 2023-06-25 10:37:30,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1315932.0, ans=0.1 2023-06-25 10:38:21,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 3.016e+02 4.169e+02 5.558e+02 1.178e+03, threshold=8.338e+02, percent-clipped=1.0 2023-06-25 10:38:43,378 INFO [train.py:996] (3/4) Epoch 8, batch 5900, loss[loss=0.1969, simple_loss=0.2725, pruned_loss=0.06064, over 21279.00 frames. ], tot_loss[loss=0.1994, simple_loss=0.288, pruned_loss=0.05537, over 4277579.75 frames. ], batch size: 159, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:38:54,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-25 10:40:19,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1316412.0, ans=0.125 2023-06-25 10:40:36,583 INFO [train.py:996] (3/4) Epoch 8, batch 5950, loss[loss=0.2172, simple_loss=0.28, pruned_loss=0.07716, over 21833.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2858, pruned_loss=0.0584, over 4273218.87 frames. ], batch size: 98, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:40:38,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1316472.0, ans=0.1 2023-06-25 10:41:48,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1316652.0, ans=0.125 2023-06-25 10:41:57,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.705e+02 4.644e+02 6.015e+02 1.261e+03, threshold=9.288e+02, percent-clipped=6.0 2023-06-25 10:42:09,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1316712.0, ans=0.125 2023-06-25 10:42:24,682 INFO [train.py:996] (3/4) Epoch 8, batch 6000, loss[loss=0.2339, simple_loss=0.2851, pruned_loss=0.09135, over 21393.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2823, pruned_loss=0.06166, over 4262758.68 frames. ], batch size: 473, lr: 3.81e-03, grad_scale: 32.0 2023-06-25 10:42:24,682 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 10:42:43,102 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2599, simple_loss=0.3542, pruned_loss=0.08283, over 1796401.00 frames. 2023-06-25 10:42:43,103 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 10:43:12,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1316832.0, ans=0.125 2023-06-25 10:43:17,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1316832.0, ans=0.125 2023-06-25 10:44:31,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1317072.0, ans=0.125 2023-06-25 10:44:32,088 INFO [train.py:996] (3/4) Epoch 8, batch 6050, loss[loss=0.158, simple_loss=0.2377, pruned_loss=0.03911, over 21454.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2772, pruned_loss=0.06268, over 4267912.71 frames. ], batch size: 195, lr: 3.81e-03, grad_scale: 16.0 2023-06-25 10:44:40,019 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-25 10:45:12,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1317132.0, ans=0.1 2023-06-25 10:45:28,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1317192.0, ans=0.0 2023-06-25 10:46:02,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.027e+02 3.543e+02 4.966e+02 9.624e+02, threshold=7.086e+02, percent-clipped=3.0 2023-06-25 10:46:04,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1317312.0, ans=0.0 2023-06-25 10:46:21,235 INFO [train.py:996] (3/4) Epoch 8, batch 6100, loss[loss=0.2226, simple_loss=0.3079, pruned_loss=0.06865, over 21507.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2761, pruned_loss=0.06125, over 4273716.32 frames. ], batch size: 471, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:47:23,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1317492.0, ans=0.025 2023-06-25 10:47:23,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-25 10:47:26,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1317492.0, ans=0.125 2023-06-25 10:47:40,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1317552.0, ans=0.2 2023-06-25 10:48:09,196 INFO [train.py:996] (3/4) Epoch 8, batch 6150, loss[loss=0.2084, simple_loss=0.2809, pruned_loss=0.06797, over 21867.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2797, pruned_loss=0.06364, over 4280407.32 frames. ], batch size: 98, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:48:18,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1317672.0, ans=0.04949747468305833 2023-06-25 10:48:38,722 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-25 10:49:33,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1317852.0, ans=0.0 2023-06-25 10:49:38,027 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.673e+02 3.233e+02 3.904e+02 5.485e+02 1.131e+03, threshold=7.808e+02, percent-clipped=12.0 2023-06-25 10:49:58,256 INFO [train.py:996] (3/4) Epoch 8, batch 6200, loss[loss=0.2168, simple_loss=0.2925, pruned_loss=0.07053, over 21524.00 frames. ], tot_loss[loss=0.207, simple_loss=0.285, pruned_loss=0.0645, over 4285817.46 frames. ], batch size: 195, lr: 3.81e-03, grad_scale: 8.0 2023-06-25 10:50:18,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1317972.0, ans=0.0 2023-06-25 10:50:58,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1318092.0, ans=0.1 2023-06-25 10:51:39,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1318212.0, ans=0.1 2023-06-25 10:51:39,321 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:51:39,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=15.0 2023-06-25 10:51:49,427 INFO [train.py:996] (3/4) Epoch 8, batch 6250, loss[loss=0.1908, simple_loss=0.2868, pruned_loss=0.04741, over 21391.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2907, pruned_loss=0.06413, over 4285731.39 frames. ], batch size: 211, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 10:52:24,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1318332.0, ans=0.125 2023-06-25 10:52:54,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1318392.0, ans=0.125 2023-06-25 10:52:59,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1318392.0, ans=0.125 2023-06-25 10:53:13,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1318452.0, ans=0.04949747468305833 2023-06-25 10:53:17,964 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.870e+02 4.537e+02 6.426e+02 9.551e+02 1.693e+03, threshold=1.285e+03, percent-clipped=41.0 2023-06-25 10:53:23,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1318512.0, ans=0.07 2023-06-25 10:53:28,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1318512.0, ans=0.125 2023-06-25 10:53:35,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.98 vs. limit=15.0 2023-06-25 10:53:42,658 INFO [train.py:996] (3/4) Epoch 8, batch 6300, loss[loss=0.2059, simple_loss=0.2857, pruned_loss=0.06301, over 21594.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2937, pruned_loss=0.06308, over 4278224.68 frames. ], batch size: 212, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 10:54:22,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-06-25 10:54:25,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1318632.0, ans=0.2 2023-06-25 10:54:42,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1318692.0, ans=0.0 2023-06-25 10:55:42,019 INFO [train.py:996] (3/4) Epoch 8, batch 6350, loss[loss=0.2622, simple_loss=0.3345, pruned_loss=0.09496, over 21558.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2964, pruned_loss=0.06754, over 4284951.93 frames. ], batch size: 414, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 10:56:34,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1318992.0, ans=0.2 2023-06-25 10:56:41,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1319052.0, ans=0.2 2023-06-25 10:57:02,229 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.857e+02 3.830e+02 4.751e+02 5.817e+02 1.226e+03, threshold=9.501e+02, percent-clipped=0.0 2023-06-25 10:57:04,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1319112.0, ans=0.0 2023-06-25 10:57:17,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1319112.0, ans=0.0 2023-06-25 10:57:27,552 INFO [train.py:996] (3/4) Epoch 8, batch 6400, loss[loss=0.2333, simple_loss=0.3112, pruned_loss=0.07773, over 21960.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3027, pruned_loss=0.07213, over 4284888.40 frames. ], batch size: 372, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 10:57:50,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1319232.0, ans=0.0 2023-06-25 10:58:48,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1319352.0, ans=0.1 2023-06-25 10:59:17,474 INFO [train.py:996] (3/4) Epoch 8, batch 6450, loss[loss=0.19, simple_loss=0.2773, pruned_loss=0.05137, over 21736.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.306, pruned_loss=0.07221, over 4284680.14 frames. ], batch size: 282, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 10:59:49,827 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=22.5 2023-06-25 11:00:35,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1319652.0, ans=0.2 2023-06-25 11:00:42,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.679e+02 3.948e+02 4.858e+02 6.624e+02 1.248e+03, threshold=9.716e+02, percent-clipped=3.0 2023-06-25 11:00:53,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1319712.0, ans=0.125 2023-06-25 11:01:05,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-25 11:01:06,989 INFO [train.py:996] (3/4) Epoch 8, batch 6500, loss[loss=0.1966, simple_loss=0.2572, pruned_loss=0.068, over 21235.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3001, pruned_loss=0.07134, over 4285365.41 frames. ], batch size: 159, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:01:16,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1319772.0, ans=0.07 2023-06-25 11:01:53,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1319892.0, ans=0.0 2023-06-25 11:02:57,205 INFO [train.py:996] (3/4) Epoch 8, batch 6550, loss[loss=0.2003, simple_loss=0.2815, pruned_loss=0.05954, over 21818.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2985, pruned_loss=0.06865, over 4276456.69 frames. ], batch size: 282, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:03:13,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-25 11:03:46,205 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-25 11:03:55,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1320192.0, ans=0.1 2023-06-25 11:04:22,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.651e+02 3.584e+02 5.538e+02 7.556e+02 1.701e+03, threshold=1.108e+03, percent-clipped=14.0 2023-06-25 11:04:45,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1320372.0, ans=0.1 2023-06-25 11:04:46,582 INFO [train.py:996] (3/4) Epoch 8, batch 6600, loss[loss=0.1856, simple_loss=0.2529, pruned_loss=0.0592, over 21802.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2923, pruned_loss=0.06804, over 4276343.23 frames. ], batch size: 98, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:04:57,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1320372.0, ans=0.1 2023-06-25 11:05:06,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1320432.0, ans=0.125 2023-06-25 11:05:48,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1320492.0, ans=0.125 2023-06-25 11:05:51,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1320492.0, ans=0.125 2023-06-25 11:05:53,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1320492.0, ans=0.1 2023-06-25 11:06:00,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1320552.0, ans=0.04949747468305833 2023-06-25 11:06:05,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1320552.0, ans=0.0 2023-06-25 11:06:36,617 INFO [train.py:996] (3/4) Epoch 8, batch 6650, loss[loss=0.1862, simple_loss=0.2376, pruned_loss=0.0674, over 20883.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2842, pruned_loss=0.06612, over 4274779.17 frames. ], batch size: 608, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:08:03,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.315e+02 4.377e+02 5.902e+02 1.210e+03, threshold=8.754e+02, percent-clipped=3.0 2023-06-25 11:08:26,382 INFO [train.py:996] (3/4) Epoch 8, batch 6700, loss[loss=0.2668, simple_loss=0.3221, pruned_loss=0.1057, over 21448.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2797, pruned_loss=0.06661, over 4273189.46 frames. ], batch size: 509, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:09:13,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=12.0 2023-06-25 11:09:23,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1321092.0, ans=0.1 2023-06-25 11:09:40,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=22.5 2023-06-25 11:09:44,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1321152.0, ans=0.125 2023-06-25 11:09:44,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1321152.0, ans=0.0 2023-06-25 11:10:04,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-06-25 11:10:10,070 INFO [train.py:996] (3/4) Epoch 8, batch 6750, loss[loss=0.2109, simple_loss=0.2798, pruned_loss=0.07104, over 21818.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2771, pruned_loss=0.06627, over 4271026.31 frames. ], batch size: 333, lr: 3.80e-03, grad_scale: 8.0 2023-06-25 11:10:57,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1321392.0, ans=0.125 2023-06-25 11:11:35,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.658e+02 3.451e+02 4.455e+02 6.236e+02 1.487e+03, threshold=8.910e+02, percent-clipped=11.0 2023-06-25 11:11:58,608 INFO [train.py:996] (3/4) Epoch 8, batch 6800, loss[loss=0.2163, simple_loss=0.2781, pruned_loss=0.07728, over 21260.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2784, pruned_loss=0.06825, over 4281526.29 frames. ], batch size: 159, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:13:14,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1321752.0, ans=0.0 2023-06-25 11:13:41,724 INFO [train.py:996] (3/4) Epoch 8, batch 6850, loss[loss=0.2076, simple_loss=0.2701, pruned_loss=0.0725, over 21770.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2791, pruned_loss=0.06864, over 4276957.38 frames. ], batch size: 300, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:14:05,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1321932.0, ans=0.0 2023-06-25 11:14:11,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.24 vs. limit=8.0 2023-06-25 11:14:37,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1321992.0, ans=0.125 2023-06-25 11:14:41,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1321992.0, ans=0.125 2023-06-25 11:14:57,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1322052.0, ans=0.2 2023-06-25 11:15:09,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 3.753e+02 5.063e+02 7.364e+02 1.523e+03, threshold=1.013e+03, percent-clipped=16.0 2023-06-25 11:15:12,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=15.0 2023-06-25 11:15:13,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1322112.0, ans=0.0 2023-06-25 11:15:31,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1322172.0, ans=0.125 2023-06-25 11:15:32,221 INFO [train.py:996] (3/4) Epoch 8, batch 6900, loss[loss=0.1959, simple_loss=0.2626, pruned_loss=0.06458, over 21820.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2812, pruned_loss=0.06856, over 4282790.46 frames. ], batch size: 247, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:15:38,219 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.90 vs. limit=10.0 2023-06-25 11:15:40,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1322172.0, ans=0.0 2023-06-25 11:16:12,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1322292.0, ans=0.125 2023-06-25 11:16:13,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-25 11:16:34,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1322292.0, ans=0.125 2023-06-25 11:17:12,461 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-25 11:17:19,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1322412.0, ans=0.125 2023-06-25 11:17:23,583 INFO [train.py:996] (3/4) Epoch 8, batch 6950, loss[loss=0.2365, simple_loss=0.3133, pruned_loss=0.07983, over 21691.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2847, pruned_loss=0.06624, over 4287519.27 frames. ], batch size: 351, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:18:21,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1322592.0, ans=0.0 2023-06-25 11:18:31,436 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=12.0 2023-06-25 11:18:37,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-25 11:18:54,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 3.544e+02 4.966e+02 6.681e+02 1.694e+03, threshold=9.931e+02, percent-clipped=7.0 2023-06-25 11:19:12,234 INFO [train.py:996] (3/4) Epoch 8, batch 7000, loss[loss=0.1993, simple_loss=0.2641, pruned_loss=0.06725, over 21737.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2859, pruned_loss=0.06723, over 4294984.79 frames. ], batch size: 317, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:19:41,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1322832.0, ans=0.2 2023-06-25 11:20:01,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1322892.0, ans=0.0 2023-06-25 11:20:02,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1322892.0, ans=0.125 2023-06-25 11:20:10,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1322892.0, ans=0.125 2023-06-25 11:20:55,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1323072.0, ans=0.125 2023-06-25 11:20:56,845 INFO [train.py:996] (3/4) Epoch 8, batch 7050, loss[loss=0.22, simple_loss=0.306, pruned_loss=0.06697, over 21608.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2838, pruned_loss=0.06706, over 4279141.89 frames. ], batch size: 414, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:20:59,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-25 11:21:23,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-25 11:21:27,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=22.5 2023-06-25 11:21:57,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1323192.0, ans=0.125 2023-06-25 11:22:05,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1323252.0, ans=0.0 2023-06-25 11:22:28,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1323252.0, ans=0.1 2023-06-25 11:22:30,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.752e+02 3.663e+02 4.659e+02 6.225e+02 9.950e+02, threshold=9.319e+02, percent-clipped=1.0 2023-06-25 11:22:36,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1323312.0, ans=0.0 2023-06-25 11:22:48,500 INFO [train.py:996] (3/4) Epoch 8, batch 7100, loss[loss=0.22, simple_loss=0.3053, pruned_loss=0.06736, over 21694.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2891, pruned_loss=0.06859, over 4285759.63 frames. ], batch size: 351, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:24:10,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1323552.0, ans=0.125 2023-06-25 11:24:14,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1323612.0, ans=0.02 2023-06-25 11:24:16,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1323612.0, ans=15.0 2023-06-25 11:24:44,533 INFO [train.py:996] (3/4) Epoch 8, batch 7150, loss[loss=0.2498, simple_loss=0.3165, pruned_loss=0.09155, over 21351.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2878, pruned_loss=0.06749, over 4284247.38 frames. ], batch size: 549, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:24:45,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1323672.0, ans=0.0 2023-06-25 11:25:14,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1323732.0, ans=0.0 2023-06-25 11:25:53,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1323852.0, ans=0.125 2023-06-25 11:25:59,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1323852.0, ans=0.0 2023-06-25 11:26:11,773 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 3.580e+02 4.514e+02 6.175e+02 1.199e+03, threshold=9.027e+02, percent-clipped=4.0 2023-06-25 11:26:39,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1323972.0, ans=0.2 2023-06-25 11:26:40,824 INFO [train.py:996] (3/4) Epoch 8, batch 7200, loss[loss=0.2245, simple_loss=0.2964, pruned_loss=0.07625, over 21835.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2895, pruned_loss=0.06881, over 4279172.21 frames. ], batch size: 98, lr: 3.80e-03, grad_scale: 32.0 2023-06-25 11:27:13,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1324092.0, ans=0.0 2023-06-25 11:27:34,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1324092.0, ans=0.0 2023-06-25 11:27:59,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1324212.0, ans=0.125 2023-06-25 11:28:13,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1324212.0, ans=0.0 2023-06-25 11:28:27,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1324272.0, ans=0.0 2023-06-25 11:28:28,907 INFO [train.py:996] (3/4) Epoch 8, batch 7250, loss[loss=0.2195, simple_loss=0.2721, pruned_loss=0.08343, over 21332.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2866, pruned_loss=0.0697, over 4273221.79 frames. ], batch size: 473, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:28:31,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=22.5 2023-06-25 11:28:36,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1324272.0, ans=0.125 2023-06-25 11:29:34,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1324452.0, ans=0.1 2023-06-25 11:29:34,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1324452.0, ans=0.035 2023-06-25 11:29:48,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1324512.0, ans=0.125 2023-06-25 11:29:51,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 3.605e+02 4.552e+02 6.343e+02 1.382e+03, threshold=9.103e+02, percent-clipped=6.0 2023-06-25 11:29:52,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.03 vs. limit=15.0 2023-06-25 11:30:17,007 INFO [train.py:996] (3/4) Epoch 8, batch 7300, loss[loss=0.1931, simple_loss=0.258, pruned_loss=0.06409, over 21519.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2811, pruned_loss=0.06855, over 4272706.53 frames. ], batch size: 391, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:30:19,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1324572.0, ans=0.125 2023-06-25 11:30:29,125 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-25 11:30:35,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1324632.0, ans=0.125 2023-06-25 11:31:01,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1324692.0, ans=0.0 2023-06-25 11:31:02,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1324692.0, ans=0.025 2023-06-25 11:31:24,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.92 vs. limit=15.0 2023-06-25 11:31:26,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1324752.0, ans=0.2 2023-06-25 11:32:07,770 INFO [train.py:996] (3/4) Epoch 8, batch 7350, loss[loss=0.2498, simple_loss=0.313, pruned_loss=0.09333, over 21319.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.278, pruned_loss=0.06821, over 4264954.37 frames. ], batch size: 159, lr: 3.80e-03, grad_scale: 16.0 2023-06-25 11:32:38,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1324932.0, ans=0.125 2023-06-25 11:33:12,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1325052.0, ans=0.125 2023-06-25 11:33:14,578 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-25 11:33:44,446 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.590e+02 4.058e+02 5.630e+02 9.164e+02 1.929e+03, threshold=1.126e+03, percent-clipped=26.0 2023-06-25 11:33:59,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1325172.0, ans=0.125 2023-06-25 11:34:01,233 INFO [train.py:996] (3/4) Epoch 8, batch 7400, loss[loss=0.1941, simple_loss=0.2492, pruned_loss=0.06951, over 20707.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2821, pruned_loss=0.07079, over 4264795.96 frames. ], batch size: 609, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:34:08,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1325172.0, ans=0.125 2023-06-25 11:34:51,329 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-25 11:35:19,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1325352.0, ans=0.2 2023-06-25 11:35:51,159 INFO [train.py:996] (3/4) Epoch 8, batch 7450, loss[loss=0.2276, simple_loss=0.2895, pruned_loss=0.08282, over 21634.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.282, pruned_loss=0.0704, over 4272286.94 frames. ], batch size: 415, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:35:55,188 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:36:08,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1325472.0, ans=0.1 2023-06-25 11:37:28,597 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.598e+02 3.413e+02 4.464e+02 6.199e+02 1.662e+03, threshold=8.927e+02, percent-clipped=2.0 2023-06-25 11:37:34,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1325712.0, ans=0.125 2023-06-25 11:37:47,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1325712.0, ans=0.0 2023-06-25 11:37:50,176 INFO [train.py:996] (3/4) Epoch 8, batch 7500, loss[loss=0.2225, simple_loss=0.307, pruned_loss=0.06904, over 21221.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2863, pruned_loss=0.07117, over 4274603.09 frames. ], batch size: 143, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:38:22,048 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:38:22,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1325832.0, ans=0.09899494936611666 2023-06-25 11:39:09,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1325952.0, ans=0.125 2023-06-25 11:39:34,787 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:39:34,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1326012.0, ans=0.125 2023-06-25 11:39:37,754 INFO [train.py:996] (3/4) Epoch 8, batch 7550, loss[loss=0.2348, simple_loss=0.3322, pruned_loss=0.0687, over 20026.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2938, pruned_loss=0.0703, over 4260841.44 frames. ], batch size: 702, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:39:38,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1326072.0, ans=0.125 2023-06-25 11:39:40,584 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.72 vs. limit=15.0 2023-06-25 11:40:08,585 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.97 vs. limit=6.0 2023-06-25 11:40:10,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1326132.0, ans=0.0 2023-06-25 11:40:45,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1326252.0, ans=0.0 2023-06-25 11:41:05,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.672e+02 5.210e+02 9.088e+02 2.173e+03, threshold=1.042e+03, percent-clipped=24.0 2023-06-25 11:41:26,455 INFO [train.py:996] (3/4) Epoch 8, batch 7600, loss[loss=0.2246, simple_loss=0.2872, pruned_loss=0.08099, over 21586.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2938, pruned_loss=0.06973, over 4271099.83 frames. ], batch size: 548, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 11:41:32,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1326372.0, ans=0.1 2023-06-25 11:41:41,038 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-25 11:42:01,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.46 vs. limit=15.0 2023-06-25 11:42:10,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1326492.0, ans=0.125 2023-06-25 11:42:36,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1326552.0, ans=0.05 2023-06-25 11:43:09,687 INFO [train.py:996] (3/4) Epoch 8, batch 7650, loss[loss=0.2275, simple_loss=0.2947, pruned_loss=0.08016, over 21760.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2936, pruned_loss=0.07057, over 4281857.40 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 11:43:27,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1326732.0, ans=0.2 2023-06-25 11:43:33,102 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-25 11:44:29,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1326852.0, ans=0.125 2023-06-25 11:44:44,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.753e+02 3.604e+02 4.352e+02 5.552e+02 1.331e+03, threshold=8.705e+02, percent-clipped=4.0 2023-06-25 11:44:59,558 INFO [train.py:996] (3/4) Epoch 8, batch 7700, loss[loss=0.2577, simple_loss=0.3233, pruned_loss=0.09609, over 21408.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2955, pruned_loss=0.07285, over 4282596.24 frames. ], batch size: 131, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:46:27,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1327212.0, ans=0.0 2023-06-25 11:46:40,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1327212.0, ans=0.1 2023-06-25 11:46:43,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1327212.0, ans=0.1 2023-06-25 11:46:46,217 INFO [train.py:996] (3/4) Epoch 8, batch 7750, loss[loss=0.2674, simple_loss=0.3727, pruned_loss=0.08106, over 21864.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3004, pruned_loss=0.07306, over 4282736.67 frames. ], batch size: 372, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:47:41,753 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-25 11:48:21,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1327512.0, ans=0.125 2023-06-25 11:48:24,882 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.807e+02 4.127e+02 5.917e+02 8.235e+02 1.345e+03, threshold=1.183e+03, percent-clipped=19.0 2023-06-25 11:48:27,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1327512.0, ans=0.0 2023-06-25 11:48:37,417 INFO [train.py:996] (3/4) Epoch 8, batch 7800, loss[loss=0.1834, simple_loss=0.2339, pruned_loss=0.06646, over 21703.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3024, pruned_loss=0.07343, over 4271911.68 frames. ], batch size: 124, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:49:30,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1327692.0, ans=0.0 2023-06-25 11:49:36,211 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:50:07,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1327812.0, ans=0.0 2023-06-25 11:50:15,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1327812.0, ans=6.0 2023-06-25 11:50:26,497 INFO [train.py:996] (3/4) Epoch 8, batch 7850, loss[loss=0.2055, simple_loss=0.2679, pruned_loss=0.07158, over 21539.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2975, pruned_loss=0.07277, over 4261970.38 frames. ], batch size: 391, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:50:40,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1327872.0, ans=0.125 2023-06-25 11:51:11,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1327932.0, ans=0.125 2023-06-25 11:52:07,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.783e+02 3.555e+02 5.085e+02 7.464e+02 1.705e+03, threshold=1.017e+03, percent-clipped=5.0 2023-06-25 11:52:26,567 INFO [train.py:996] (3/4) Epoch 8, batch 7900, loss[loss=0.2101, simple_loss=0.3304, pruned_loss=0.04488, over 19825.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2926, pruned_loss=0.07169, over 4262924.07 frames. ], batch size: 702, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:52:29,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1328172.0, ans=0.125 2023-06-25 11:52:32,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1328172.0, ans=0.07 2023-06-25 11:53:12,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.24 vs. limit=15.0 2023-06-25 11:53:30,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1328292.0, ans=0.1 2023-06-25 11:54:00,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=22.5 2023-06-25 11:54:24,570 INFO [train.py:996] (3/4) Epoch 8, batch 7950, loss[loss=0.225, simple_loss=0.3082, pruned_loss=0.0709, over 21429.00 frames. ], tot_loss[loss=0.22, simple_loss=0.297, pruned_loss=0.07145, over 4267397.54 frames. ], batch size: 194, lr: 3.79e-03, grad_scale: 8.0 2023-06-25 11:55:06,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1328532.0, ans=0.125 2023-06-25 11:55:08,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1328532.0, ans=0.0 2023-06-25 11:55:10,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1328592.0, ans=0.1 2023-06-25 11:55:46,734 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.87 vs. limit=22.5 2023-06-25 11:56:11,365 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 4.611e+02 6.417e+02 9.938e+02 3.239e+03, threshold=1.283e+03, percent-clipped=22.0 2023-06-25 11:56:24,105 INFO [train.py:996] (3/4) Epoch 8, batch 8000, loss[loss=0.1872, simple_loss=0.2332, pruned_loss=0.07056, over 20033.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3004, pruned_loss=0.07408, over 4269932.03 frames. ], batch size: 704, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:56:43,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-25 11:57:15,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1328892.0, ans=0.1 2023-06-25 11:57:31,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1328952.0, ans=0.125 2023-06-25 11:57:53,021 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-25 11:58:05,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1329012.0, ans=0.5 2023-06-25 11:58:07,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1329012.0, ans=15.0 2023-06-25 11:58:19,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329012.0, ans=0.1 2023-06-25 11:58:24,047 INFO [train.py:996] (3/4) Epoch 8, batch 8050, loss[loss=0.2154, simple_loss=0.2949, pruned_loss=0.06796, over 21787.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3042, pruned_loss=0.07467, over 4271305.57 frames. ], batch size: 282, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 11:58:43,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-25 11:58:46,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1329132.0, ans=0.125 2023-06-25 11:59:55,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1329312.0, ans=0.125 2023-06-25 12:00:03,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.754e+02 4.648e+02 6.798e+02 1.163e+03 2.924e+03, threshold=1.360e+03, percent-clipped=20.0 2023-06-25 12:00:04,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1329312.0, ans=0.125 2023-06-25 12:00:16,731 INFO [train.py:996] (3/4) Epoch 8, batch 8100, loss[loss=0.2107, simple_loss=0.2837, pruned_loss=0.06881, over 21784.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3025, pruned_loss=0.07538, over 4277134.96 frames. ], batch size: 247, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:01:14,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1329492.0, ans=0.125 2023-06-25 12:02:15,853 INFO [train.py:996] (3/4) Epoch 8, batch 8150, loss[loss=0.237, simple_loss=0.3346, pruned_loss=0.06965, over 21792.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3113, pruned_loss=0.07699, over 4277464.79 frames. ], batch size: 352, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:02:18,119 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:03:22,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329792.0, ans=0.1 2023-06-25 12:03:28,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1329852.0, ans=0.1 2023-06-25 12:03:30,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1329852.0, ans=0.2 2023-06-25 12:03:40,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1329912.0, ans=0.125 2023-06-25 12:03:47,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.902e+02 4.317e+02 6.289e+02 1.033e+03 2.172e+03, threshold=1.258e+03, percent-clipped=12.0 2023-06-25 12:03:57,205 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-25 12:04:04,947 INFO [train.py:996] (3/4) Epoch 8, batch 8200, loss[loss=0.175, simple_loss=0.2364, pruned_loss=0.05679, over 21167.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3033, pruned_loss=0.0745, over 4272899.13 frames. ], batch size: 143, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:05:54,967 INFO [train.py:996] (3/4) Epoch 8, batch 8250, loss[loss=0.1801, simple_loss=0.2687, pruned_loss=0.04579, over 21334.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3018, pruned_loss=0.07373, over 4278956.98 frames. ], batch size: 131, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:06:22,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1330332.0, ans=0.0 2023-06-25 12:06:31,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1330332.0, ans=0.0 2023-06-25 12:07:17,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1330452.0, ans=0.125 2023-06-25 12:07:27,235 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.445e+02 3.428e+02 4.253e+02 6.741e+02 1.234e+03, threshold=8.505e+02, percent-clipped=0.0 2023-06-25 12:07:50,413 INFO [train.py:996] (3/4) Epoch 8, batch 8300, loss[loss=0.2164, simple_loss=0.3081, pruned_loss=0.06236, over 21206.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2994, pruned_loss=0.071, over 4274428.23 frames. ], batch size: 548, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:08:05,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1330572.0, ans=0.1 2023-06-25 12:08:50,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1330692.0, ans=0.125 2023-06-25 12:08:54,818 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=12.0 2023-06-25 12:08:54,909 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-25 12:09:01,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1330752.0, ans=0.07 2023-06-25 12:09:14,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1330812.0, ans=0.125 2023-06-25 12:09:38,995 INFO [train.py:996] (3/4) Epoch 8, batch 8350, loss[loss=0.2153, simple_loss=0.2941, pruned_loss=0.06819, over 21668.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2982, pruned_loss=0.06894, over 4273841.82 frames. ], batch size: 415, lr: 3.79e-03, grad_scale: 16.0 2023-06-25 12:09:57,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1330872.0, ans=0.125 2023-06-25 12:10:04,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1330932.0, ans=0.0 2023-06-25 12:10:29,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1330992.0, ans=0.125 2023-06-25 12:10:34,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1330992.0, ans=0.125 2023-06-25 12:10:41,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1331052.0, ans=0.125 2023-06-25 12:10:53,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1331052.0, ans=0.0 2023-06-25 12:11:10,348 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.535e+02 3.482e+02 5.019e+02 7.188e+02 1.647e+03, threshold=1.004e+03, percent-clipped=15.0 2023-06-25 12:11:27,223 INFO [train.py:996] (3/4) Epoch 8, batch 8400, loss[loss=0.1968, simple_loss=0.2778, pruned_loss=0.05794, over 21403.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2961, pruned_loss=0.06673, over 4276310.57 frames. ], batch size: 194, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 12:11:53,347 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-25 12:12:09,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1331292.0, ans=0.0 2023-06-25 12:12:26,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=22.5 2023-06-25 12:12:46,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1331412.0, ans=0.125 2023-06-25 12:13:15,375 INFO [train.py:996] (3/4) Epoch 8, batch 8450, loss[loss=0.2185, simple_loss=0.2836, pruned_loss=0.07668, over 21735.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2929, pruned_loss=0.06673, over 4281924.89 frames. ], batch size: 414, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 12:13:21,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1331472.0, ans=0.125 2023-06-25 12:14:45,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 3.841e+02 5.103e+02 7.112e+02 1.474e+03, threshold=1.021e+03, percent-clipped=11.0 2023-06-25 12:15:04,518 INFO [train.py:996] (3/4) Epoch 8, batch 8500, loss[loss=0.189, simple_loss=0.2521, pruned_loss=0.06298, over 21531.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.289, pruned_loss=0.06764, over 4280047.01 frames. ], batch size: 230, lr: 3.79e-03, grad_scale: 32.0 2023-06-25 12:15:27,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1331832.0, ans=0.0 2023-06-25 12:16:00,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1331892.0, ans=0.125 2023-06-25 12:16:11,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1331952.0, ans=0.0 2023-06-25 12:16:17,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-06-25 12:16:44,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1332012.0, ans=0.125 2023-06-25 12:16:56,633 INFO [train.py:996] (3/4) Epoch 8, batch 8550, loss[loss=0.2692, simple_loss=0.3673, pruned_loss=0.08553, over 21250.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2955, pruned_loss=0.07063, over 4286494.78 frames. ], batch size: 548, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:16:57,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1332072.0, ans=0.1 2023-06-25 12:17:27,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1332132.0, ans=0.2 2023-06-25 12:17:51,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1332192.0, ans=0.125 2023-06-25 12:17:54,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1332192.0, ans=0.1 2023-06-25 12:18:36,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.783e+02 4.136e+02 5.316e+02 7.631e+02 1.468e+03, threshold=1.063e+03, percent-clipped=11.0 2023-06-25 12:18:52,715 INFO [train.py:996] (3/4) Epoch 8, batch 8600, loss[loss=0.2903, simple_loss=0.3572, pruned_loss=0.1117, over 21309.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3027, pruned_loss=0.07351, over 4282047.12 frames. ], batch size: 143, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:19:28,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1332432.0, ans=0.125 2023-06-25 12:20:23,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1332612.0, ans=0.125 2023-06-25 12:20:26,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1332612.0, ans=0.1 2023-06-25 12:20:43,307 INFO [train.py:996] (3/4) Epoch 8, batch 8650, loss[loss=0.195, simple_loss=0.2802, pruned_loss=0.05496, over 21106.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3082, pruned_loss=0.07386, over 4285242.67 frames. ], batch size: 143, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:21:29,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1332792.0, ans=0.1 2023-06-25 12:22:16,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.872e+02 5.286e+02 7.583e+02 1.337e+03, threshold=1.057e+03, percent-clipped=5.0 2023-06-25 12:22:32,425 INFO [train.py:996] (3/4) Epoch 8, batch 8700, loss[loss=0.1908, simple_loss=0.2727, pruned_loss=0.05439, over 15281.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3005, pruned_loss=0.07063, over 4270533.60 frames. ], batch size: 61, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:22:43,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1332972.0, ans=0.0 2023-06-25 12:22:46,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1332972.0, ans=0.125 2023-06-25 12:23:03,062 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-25 12:23:42,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1333152.0, ans=0.0 2023-06-25 12:24:10,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1333212.0, ans=0.0 2023-06-25 12:24:21,807 INFO [train.py:996] (3/4) Epoch 8, batch 8750, loss[loss=0.2264, simple_loss=0.2993, pruned_loss=0.07674, over 21878.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2971, pruned_loss=0.07097, over 4271979.01 frames. ], batch size: 124, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:24:22,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1333272.0, ans=10.0 2023-06-25 12:24:45,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1333332.0, ans=0.125 2023-06-25 12:24:59,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1333332.0, ans=0.025 2023-06-25 12:25:19,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1333392.0, ans=0.0 2023-06-25 12:25:29,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1333452.0, ans=0.125 2023-06-25 12:25:46,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1333452.0, ans=0.125 2023-06-25 12:25:57,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1333512.0, ans=0.125 2023-06-25 12:26:02,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.799e+02 3.947e+02 5.629e+02 7.790e+02 1.713e+03, threshold=1.126e+03, percent-clipped=18.0 2023-06-25 12:26:18,051 INFO [train.py:996] (3/4) Epoch 8, batch 8800, loss[loss=0.3364, simple_loss=0.4007, pruned_loss=0.136, over 21443.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3079, pruned_loss=0.07388, over 4271096.98 frames. ], batch size: 507, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:27:12,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-25 12:27:22,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1333752.0, ans=0.0 2023-06-25 12:28:07,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1333872.0, ans=0.02 2023-06-25 12:28:08,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1333872.0, ans=0.125 2023-06-25 12:28:09,017 INFO [train.py:996] (3/4) Epoch 8, batch 8850, loss[loss=0.2186, simple_loss=0.3157, pruned_loss=0.0607, over 21295.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3141, pruned_loss=0.07579, over 4272777.62 frames. ], batch size: 176, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:28:43,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1333932.0, ans=0.2 2023-06-25 12:28:43,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1333932.0, ans=0.125 2023-06-25 12:28:47,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1333932.0, ans=0.125 2023-06-25 12:28:49,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1333992.0, ans=0.125 2023-06-25 12:29:02,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1333992.0, ans=0.1 2023-06-25 12:29:07,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1333992.0, ans=0.0 2023-06-25 12:29:09,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1333992.0, ans=0.125 2023-06-25 12:29:29,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1334052.0, ans=0.125 2023-06-25 12:29:51,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.618e+02 3.568e+02 4.882e+02 6.738e+02 2.080e+03, threshold=9.764e+02, percent-clipped=3.0 2023-06-25 12:30:01,479 INFO [train.py:996] (3/4) Epoch 8, batch 8900, loss[loss=0.2134, simple_loss=0.2928, pruned_loss=0.06704, over 21754.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3071, pruned_loss=0.07426, over 4265993.98 frames. ], batch size: 351, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:30:20,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1334172.0, ans=0.2 2023-06-25 12:30:27,801 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2023-06-25 12:30:53,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1334292.0, ans=0.0 2023-06-25 12:31:22,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=24.38 vs. limit=22.5 2023-06-25 12:31:40,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.90 vs. limit=6.0 2023-06-25 12:31:58,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1334472.0, ans=0.5 2023-06-25 12:31:59,201 INFO [train.py:996] (3/4) Epoch 8, batch 8950, loss[loss=0.2132, simple_loss=0.2785, pruned_loss=0.07392, over 21596.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3075, pruned_loss=0.07323, over 4267478.88 frames. ], batch size: 263, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:32:09,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1334472.0, ans=0.0 2023-06-25 12:32:27,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1334532.0, ans=0.125 2023-06-25 12:32:35,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1334532.0, ans=0.125 2023-06-25 12:32:41,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1334592.0, ans=0.125 2023-06-25 12:32:47,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1334592.0, ans=0.0 2023-06-25 12:33:13,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1334652.0, ans=0.125 2023-06-25 12:33:34,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.718e+02 4.076e+02 6.080e+02 7.762e+02 1.933e+03, threshold=1.216e+03, percent-clipped=14.0 2023-06-25 12:33:48,737 INFO [train.py:996] (3/4) Epoch 8, batch 9000, loss[loss=0.2127, simple_loss=0.2738, pruned_loss=0.0758, over 21617.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3008, pruned_loss=0.07245, over 4270993.52 frames. ], batch size: 332, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:33:48,738 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 12:34:07,159 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2631, simple_loss=0.3554, pruned_loss=0.08544, over 1796401.00 frames. 2023-06-25 12:34:07,160 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 12:34:13,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1334772.0, ans=0.125 2023-06-25 12:35:32,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1334952.0, ans=0.125 2023-06-25 12:35:56,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1335072.0, ans=0.125 2023-06-25 12:35:57,424 INFO [train.py:996] (3/4) Epoch 8, batch 9050, loss[loss=0.2024, simple_loss=0.2843, pruned_loss=0.06029, over 21331.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2967, pruned_loss=0.06893, over 4270018.30 frames. ], batch size: 211, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:36:09,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1335072.0, ans=0.2 2023-06-25 12:36:09,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1335072.0, ans=0.125 2023-06-25 12:36:36,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1335132.0, ans=0.0 2023-06-25 12:37:17,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-25 12:37:31,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1335312.0, ans=0.125 2023-06-25 12:37:46,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1335312.0, ans=0.2 2023-06-25 12:37:47,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.667e+02 3.976e+02 5.366e+02 7.574e+02 1.688e+03, threshold=1.073e+03, percent-clipped=5.0 2023-06-25 12:37:47,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1335312.0, ans=0.125 2023-06-25 12:37:55,892 INFO [train.py:996] (3/4) Epoch 8, batch 9100, loss[loss=0.21, simple_loss=0.3056, pruned_loss=0.05717, over 21586.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3021, pruned_loss=0.07048, over 4273397.74 frames. ], batch size: 230, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:38:16,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-25 12:38:38,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1335432.0, ans=0.0 2023-06-25 12:38:42,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1335492.0, ans=0.0 2023-06-25 12:39:29,817 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:39:44,125 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:39:47,136 INFO [train.py:996] (3/4) Epoch 8, batch 9150, loss[loss=0.2711, simple_loss=0.3661, pruned_loss=0.08803, over 21578.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3072, pruned_loss=0.06904, over 4271472.73 frames. ], batch size: 471, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:40:46,338 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-25 12:40:55,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1335852.0, ans=0.04949747468305833 2023-06-25 12:41:27,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.582e+02 4.285e+02 5.759e+02 1.145e+03, threshold=8.570e+02, percent-clipped=4.0 2023-06-25 12:41:28,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1335912.0, ans=0.0 2023-06-25 12:41:46,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1335972.0, ans=0.0 2023-06-25 12:41:47,514 INFO [train.py:996] (3/4) Epoch 8, batch 9200, loss[loss=0.2482, simple_loss=0.3319, pruned_loss=0.08222, over 21268.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3084, pruned_loss=0.06819, over 4274798.87 frames. ], batch size: 548, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:42:17,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1336032.0, ans=0.2 2023-06-25 12:43:34,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1336212.0, ans=10.0 2023-06-25 12:43:37,057 INFO [train.py:996] (3/4) Epoch 8, batch 9250, loss[loss=0.236, simple_loss=0.3527, pruned_loss=0.05963, over 19712.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3088, pruned_loss=0.07102, over 4277690.70 frames. ], batch size: 702, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:44:15,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1336332.0, ans=0.0 2023-06-25 12:44:52,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-25 12:45:21,451 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.801e+02 3.683e+02 5.339e+02 7.868e+02 1.539e+03, threshold=1.068e+03, percent-clipped=20.0 2023-06-25 12:45:25,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1336512.0, ans=0.125 2023-06-25 12:45:28,187 INFO [train.py:996] (3/4) Epoch 8, batch 9300, loss[loss=0.1859, simple_loss=0.2474, pruned_loss=0.06219, over 21343.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3033, pruned_loss=0.07112, over 4271563.44 frames. ], batch size: 177, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:45:41,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1336572.0, ans=0.125 2023-06-25 12:46:04,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.11 vs. limit=10.0 2023-06-25 12:46:36,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1336752.0, ans=0.125 2023-06-25 12:47:19,180 INFO [train.py:996] (3/4) Epoch 8, batch 9350, loss[loss=0.2302, simple_loss=0.3123, pruned_loss=0.07408, over 21484.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3087, pruned_loss=0.07224, over 4265397.94 frames. ], batch size: 211, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:47:23,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1336872.0, ans=0.2 2023-06-25 12:48:09,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1336992.0, ans=0.0 2023-06-25 12:49:02,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.027e+02 4.116e+02 5.791e+02 8.209e+02 2.175e+03, threshold=1.158e+03, percent-clipped=13.0 2023-06-25 12:49:10,230 INFO [train.py:996] (3/4) Epoch 8, batch 9400, loss[loss=0.2164, simple_loss=0.2865, pruned_loss=0.07317, over 21762.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3087, pruned_loss=0.07234, over 4273391.26 frames. ], batch size: 124, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:50:23,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1337352.0, ans=0.0 2023-06-25 12:50:29,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1337352.0, ans=0.125 2023-06-25 12:50:48,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1337412.0, ans=0.125 2023-06-25 12:51:05,907 INFO [train.py:996] (3/4) Epoch 8, batch 9450, loss[loss=0.1741, simple_loss=0.244, pruned_loss=0.05208, over 21672.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.301, pruned_loss=0.07159, over 4269780.83 frames. ], batch size: 282, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:51:16,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1337472.0, ans=0.0 2023-06-25 12:51:18,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1337472.0, ans=0.125 2023-06-25 12:51:46,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1337592.0, ans=0.0 2023-06-25 12:52:26,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1337652.0, ans=0.2 2023-06-25 12:52:26,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1337652.0, ans=0.0 2023-06-25 12:52:27,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 12:52:41,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.789e+02 4.276e+02 5.565e+02 7.806e+02 1.820e+03, threshold=1.113e+03, percent-clipped=7.0 2023-06-25 12:52:44,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1337712.0, ans=0.125 2023-06-25 12:52:48,396 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-25 12:52:48,835 INFO [train.py:996] (3/4) Epoch 8, batch 9500, loss[loss=0.1927, simple_loss=0.2606, pruned_loss=0.0624, over 22008.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2928, pruned_loss=0.0694, over 4260352.56 frames. ], batch size: 375, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:53:02,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1337772.0, ans=0.2 2023-06-25 12:53:07,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1337772.0, ans=0.2 2023-06-25 12:53:14,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1337832.0, ans=0.125 2023-06-25 12:53:58,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1337892.0, ans=0.0 2023-06-25 12:53:59,617 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-25 12:54:43,654 INFO [train.py:996] (3/4) Epoch 8, batch 9550, loss[loss=0.2546, simple_loss=0.3253, pruned_loss=0.09199, over 21745.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2974, pruned_loss=0.0713, over 4265971.73 frames. ], batch size: 441, lr: 3.78e-03, grad_scale: 16.0 2023-06-25 12:56:26,043 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.967e+02 4.048e+02 5.374e+02 8.215e+02 1.903e+03, threshold=1.075e+03, percent-clipped=10.0 2023-06-25 12:56:32,888 INFO [train.py:996] (3/4) Epoch 8, batch 9600, loss[loss=0.2158, simple_loss=0.2883, pruned_loss=0.0717, over 21686.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2993, pruned_loss=0.07278, over 4270978.14 frames. ], batch size: 263, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:57:42,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1338552.0, ans=0.1 2023-06-25 12:57:43,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1338552.0, ans=0.125 2023-06-25 12:57:48,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1338552.0, ans=10.0 2023-06-25 12:58:10,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1338612.0, ans=0.0 2023-06-25 12:58:24,347 INFO [train.py:996] (3/4) Epoch 8, batch 9650, loss[loss=0.2161, simple_loss=0.2934, pruned_loss=0.06936, over 21723.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2992, pruned_loss=0.07261, over 4278088.34 frames. ], batch size: 298, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 12:59:32,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1338852.0, ans=0.125 2023-06-25 12:59:37,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-25 12:59:49,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-25 13:00:07,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.726e+02 3.684e+02 4.580e+02 6.595e+02 1.807e+03, threshold=9.160e+02, percent-clipped=4.0 2023-06-25 13:00:20,071 INFO [train.py:996] (3/4) Epoch 8, batch 9700, loss[loss=0.236, simple_loss=0.3086, pruned_loss=0.0817, over 21766.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3024, pruned_loss=0.07259, over 4276623.40 frames. ], batch size: 414, lr: 3.78e-03, grad_scale: 32.0 2023-06-25 13:01:14,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1339092.0, ans=0.2 2023-06-25 13:01:14,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1339092.0, ans=0.125 2023-06-25 13:02:02,386 INFO [train.py:996] (3/4) Epoch 8, batch 9750, loss[loss=0.2146, simple_loss=0.3328, pruned_loss=0.04824, over 20833.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2969, pruned_loss=0.07095, over 4271211.56 frames. ], batch size: 608, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:02:15,574 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:03:22,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1339452.0, ans=0.125 2023-06-25 13:03:28,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1339512.0, ans=0.125 2023-06-25 13:03:42,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.775e+02 3.743e+02 5.532e+02 7.768e+02 2.224e+03, threshold=1.106e+03, percent-clipped=14.0 2023-06-25 13:03:49,293 INFO [train.py:996] (3/4) Epoch 8, batch 9800, loss[loss=0.2306, simple_loss=0.299, pruned_loss=0.08109, over 21882.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2962, pruned_loss=0.07126, over 4270929.78 frames. ], batch size: 371, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:03:51,577 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:04:21,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1339632.0, ans=0.125 2023-06-25 13:04:21,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1339632.0, ans=0.125 2023-06-25 13:04:37,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1339692.0, ans=0.0 2023-06-25 13:04:37,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1339692.0, ans=0.125 2023-06-25 13:05:19,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1339812.0, ans=0.125 2023-06-25 13:05:37,669 INFO [train.py:996] (3/4) Epoch 8, batch 9850, loss[loss=0.1715, simple_loss=0.2435, pruned_loss=0.04972, over 21677.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2924, pruned_loss=0.07127, over 4272120.74 frames. ], batch size: 282, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:05:49,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1339872.0, ans=0.125 2023-06-25 13:06:15,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1339932.0, ans=0.125 2023-06-25 13:06:52,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-25 13:07:13,708 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-25 13:07:13,722 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-25 13:07:19,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.843e+02 3.728e+02 4.692e+02 6.683e+02 1.521e+03, threshold=9.384e+02, percent-clipped=6.0 2023-06-25 13:07:23,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1340112.0, ans=0.0 2023-06-25 13:07:26,587 INFO [train.py:996] (3/4) Epoch 8, batch 9900, loss[loss=0.2418, simple_loss=0.3177, pruned_loss=0.08296, over 21693.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2888, pruned_loss=0.07041, over 4265314.07 frames. ], batch size: 351, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:07:54,173 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=12.0 2023-06-25 13:08:45,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1340352.0, ans=0.125 2023-06-25 13:09:10,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1340412.0, ans=0.125 2023-06-25 13:09:14,603 INFO [train.py:996] (3/4) Epoch 8, batch 9950, loss[loss=0.2607, simple_loss=0.3445, pruned_loss=0.08838, over 19960.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2924, pruned_loss=0.07317, over 4261150.23 frames. ], batch size: 703, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:09:21,186 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-06-25 13:09:39,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2023-06-25 13:09:40,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1340532.0, ans=0.125 2023-06-25 13:09:58,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1340532.0, ans=0.1 2023-06-25 13:10:24,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1340652.0, ans=0.125 2023-06-25 13:10:59,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.606e+02 3.700e+02 4.924e+02 7.179e+02 1.701e+03, threshold=9.849e+02, percent-clipped=16.0 2023-06-25 13:11:11,615 INFO [train.py:996] (3/4) Epoch 8, batch 10000, loss[loss=0.1783, simple_loss=0.2591, pruned_loss=0.04877, over 21741.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.287, pruned_loss=0.07097, over 4265534.66 frames. ], batch size: 352, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:12:30,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1340952.0, ans=0.1 2023-06-25 13:12:37,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1341012.0, ans=0.1 2023-06-25 13:13:02,329 INFO [train.py:996] (3/4) Epoch 8, batch 10050, loss[loss=0.1706, simple_loss=0.2597, pruned_loss=0.0407, over 16792.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2895, pruned_loss=0.07149, over 4270323.07 frames. ], batch size: 60, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:13:04,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1341072.0, ans=0.05 2023-06-25 13:13:43,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1341132.0, ans=0.125 2023-06-25 13:13:45,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-25 13:13:58,283 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:14:13,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.30 vs. limit=15.0 2023-06-25 13:14:21,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1341252.0, ans=0.1 2023-06-25 13:14:23,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1341252.0, ans=0.0 2023-06-25 13:14:25,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1341252.0, ans=0.125 2023-06-25 13:14:41,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1341312.0, ans=0.125 2023-06-25 13:14:55,226 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.655e+02 4.346e+02 5.951e+02 7.848e+02 1.633e+03, threshold=1.190e+03, percent-clipped=16.0 2023-06-25 13:14:58,305 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.84 vs. limit=10.0 2023-06-25 13:14:58,751 INFO [train.py:996] (3/4) Epoch 8, batch 10100, loss[loss=0.2004, simple_loss=0.2469, pruned_loss=0.07692, over 20211.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2863, pruned_loss=0.06963, over 4274868.74 frames. ], batch size: 703, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:15:11,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1341372.0, ans=0.125 2023-06-25 13:16:24,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1341612.0, ans=0.2 2023-06-25 13:16:48,269 INFO [train.py:996] (3/4) Epoch 8, batch 10150, loss[loss=0.2266, simple_loss=0.2981, pruned_loss=0.07757, over 21434.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2933, pruned_loss=0.07243, over 4271297.29 frames. ], batch size: 194, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:17:51,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1341852.0, ans=0.2 2023-06-25 13:18:25,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1341912.0, ans=0.125 2023-06-25 13:18:38,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.458e+02 4.384e+02 5.388e+02 1.096e+03, threshold=8.768e+02, percent-clipped=0.0 2023-06-25 13:18:42,761 INFO [train.py:996] (3/4) Epoch 8, batch 10200, loss[loss=0.1993, simple_loss=0.2843, pruned_loss=0.05712, over 21699.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2934, pruned_loss=0.07074, over 4265184.24 frames. ], batch size: 298, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:18:59,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1342032.0, ans=0.025 2023-06-25 13:19:38,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1342092.0, ans=0.125 2023-06-25 13:19:53,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1342152.0, ans=0.125 2023-06-25 13:20:34,613 INFO [train.py:996] (3/4) Epoch 8, batch 10250, loss[loss=0.2628, simple_loss=0.3432, pruned_loss=0.09124, over 21832.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2914, pruned_loss=0.06771, over 4259864.28 frames. ], batch size: 124, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:20:49,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1342272.0, ans=0.0 2023-06-25 13:22:23,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.001e+02 3.628e+02 5.027e+02 6.947e+02 1.354e+03, threshold=1.005e+03, percent-clipped=10.0 2023-06-25 13:22:26,723 INFO [train.py:996] (3/4) Epoch 8, batch 10300, loss[loss=0.2371, simple_loss=0.3237, pruned_loss=0.07526, over 21900.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2944, pruned_loss=0.06764, over 4262218.99 frames. ], batch size: 372, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:22:29,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1342572.0, ans=0.125 2023-06-25 13:23:03,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1342632.0, ans=0.125 2023-06-25 13:23:37,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1342752.0, ans=0.0 2023-06-25 13:24:18,477 INFO [train.py:996] (3/4) Epoch 8, batch 10350, loss[loss=0.1851, simple_loss=0.2689, pruned_loss=0.05061, over 21835.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2957, pruned_loss=0.06746, over 4257862.89 frames. ], batch size: 317, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:25:02,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=12.0 2023-06-25 13:25:19,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1342992.0, ans=0.125 2023-06-25 13:25:35,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1343052.0, ans=0.1 2023-06-25 13:25:43,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-25 13:26:05,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.961e+02 4.465e+02 6.325e+02 1.027e+03 2.051e+03, threshold=1.265e+03, percent-clipped=26.0 2023-06-25 13:26:15,286 INFO [train.py:996] (3/4) Epoch 8, batch 10400, loss[loss=0.1746, simple_loss=0.2483, pruned_loss=0.05046, over 21402.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2895, pruned_loss=0.0663, over 4257916.66 frames. ], batch size: 194, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:27:36,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-25 13:27:37,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1343352.0, ans=0.1 2023-06-25 13:28:03,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1343412.0, ans=0.125 2023-06-25 13:28:06,154 INFO [train.py:996] (3/4) Epoch 8, batch 10450, loss[loss=0.2217, simple_loss=0.3065, pruned_loss=0.06849, over 21656.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2919, pruned_loss=0.06861, over 4266553.69 frames. ], batch size: 263, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:28:12,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1343472.0, ans=0.125 2023-06-25 13:28:40,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-25 13:28:43,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1343532.0, ans=0.0 2023-06-25 13:29:06,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1343592.0, ans=0.2 2023-06-25 13:29:25,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1343652.0, ans=0.125 2023-06-25 13:29:51,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1343712.0, ans=0.1 2023-06-25 13:29:52,716 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.857e+02 4.046e+02 6.081e+02 8.924e+02 1.860e+03, threshold=1.216e+03, percent-clipped=7.0 2023-06-25 13:29:54,306 INFO [train.py:996] (3/4) Epoch 8, batch 10500, loss[loss=0.2297, simple_loss=0.2862, pruned_loss=0.0866, over 21297.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2907, pruned_loss=0.06823, over 4271094.64 frames. ], batch size: 471, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:30:29,257 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-06-25 13:31:20,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1344012.0, ans=0.125 2023-06-25 13:31:44,409 INFO [train.py:996] (3/4) Epoch 8, batch 10550, loss[loss=0.1972, simple_loss=0.2656, pruned_loss=0.06434, over 21199.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2865, pruned_loss=0.06719, over 4259081.15 frames. ], batch size: 548, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:31:49,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-06-25 13:32:57,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1344252.0, ans=0.0 2023-06-25 13:33:01,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1344252.0, ans=10.0 2023-06-25 13:33:10,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1344252.0, ans=0.0 2023-06-25 13:33:35,043 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.879e+02 5.008e+02 7.044e+02 1.478e+03, threshold=1.002e+03, percent-clipped=2.0 2023-06-25 13:33:37,116 INFO [train.py:996] (3/4) Epoch 8, batch 10600, loss[loss=0.2467, simple_loss=0.3417, pruned_loss=0.07585, over 21624.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2837, pruned_loss=0.06634, over 4252100.04 frames. ], batch size: 414, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:34:18,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-25 13:34:22,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.17 vs. limit=22.5 2023-06-25 13:34:23,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1344432.0, ans=0.125 2023-06-25 13:35:09,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1344612.0, ans=0.125 2023-06-25 13:35:11,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1344612.0, ans=0.0 2023-06-25 13:35:32,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-25 13:35:34,391 INFO [train.py:996] (3/4) Epoch 8, batch 10650, loss[loss=0.173, simple_loss=0.2596, pruned_loss=0.04316, over 21699.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2856, pruned_loss=0.06501, over 4252965.20 frames. ], batch size: 298, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:36:24,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1344792.0, ans=0.125 2023-06-25 13:37:23,246 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.596e+02 3.773e+02 5.055e+02 6.605e+02 1.042e+03, threshold=1.011e+03, percent-clipped=1.0 2023-06-25 13:37:30,131 INFO [train.py:996] (3/4) Epoch 8, batch 10700, loss[loss=0.174, simple_loss=0.2485, pruned_loss=0.04977, over 21403.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2851, pruned_loss=0.06583, over 4251431.37 frames. ], batch size: 194, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:37:42,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1344972.0, ans=0.125 2023-06-25 13:38:33,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1345152.0, ans=0.125 2023-06-25 13:38:58,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1345212.0, ans=0.125 2023-06-25 13:39:22,490 INFO [train.py:996] (3/4) Epoch 8, batch 10750, loss[loss=0.2727, simple_loss=0.3689, pruned_loss=0.08825, over 21685.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2951, pruned_loss=0.06988, over 4258399.43 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:39:26,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1345272.0, ans=0.125 2023-06-25 13:39:28,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1345272.0, ans=0.125 2023-06-25 13:39:58,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1345392.0, ans=0.0 2023-06-25 13:40:01,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1345392.0, ans=0.0 2023-06-25 13:40:47,123 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-25 13:40:51,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1345512.0, ans=0.125 2023-06-25 13:41:05,928 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.724e+02 3.874e+02 4.652e+02 6.783e+02 1.933e+03, threshold=9.304e+02, percent-clipped=9.0 2023-06-25 13:41:08,308 INFO [train.py:996] (3/4) Epoch 8, batch 10800, loss[loss=0.2156, simple_loss=0.2943, pruned_loss=0.06839, over 19989.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3004, pruned_loss=0.07058, over 4262556.05 frames. ], batch size: 702, lr: 3.77e-03, grad_scale: 32.0 2023-06-25 13:41:52,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1345692.0, ans=0.125 2023-06-25 13:42:53,467 INFO [train.py:996] (3/4) Epoch 8, batch 10850, loss[loss=0.1901, simple_loss=0.269, pruned_loss=0.0556, over 21703.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3003, pruned_loss=0.071, over 4268912.13 frames. ], batch size: 333, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:43:31,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1345932.0, ans=0.125 2023-06-25 13:44:43,296 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 4.154e+02 5.827e+02 8.227e+02 1.341e+03, threshold=1.165e+03, percent-clipped=17.0 2023-06-25 13:44:43,334 INFO [train.py:996] (3/4) Epoch 8, batch 10900, loss[loss=0.1956, simple_loss=0.2774, pruned_loss=0.05688, over 21398.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2934, pruned_loss=0.06923, over 4246799.53 frames. ], batch size: 194, lr: 3.77e-03, grad_scale: 16.0 2023-06-25 13:44:49,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1346172.0, ans=0.5 2023-06-25 13:44:56,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1346172.0, ans=0.2 2023-06-25 13:44:59,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1346232.0, ans=0.125 2023-06-25 13:45:30,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.98 vs. limit=10.0 2023-06-25 13:45:33,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1346292.0, ans=0.95 2023-06-25 13:46:28,365 INFO [train.py:996] (3/4) Epoch 8, batch 10950, loss[loss=0.1886, simple_loss=0.2621, pruned_loss=0.05749, over 21576.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2899, pruned_loss=0.0676, over 4248725.09 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:47:15,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1346592.0, ans=0.1 2023-06-25 13:47:29,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1346592.0, ans=0.125 2023-06-25 13:47:31,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1346592.0, ans=0.125 2023-06-25 13:47:35,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346652.0, ans=0.1 2023-06-25 13:47:41,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346652.0, ans=0.1 2023-06-25 13:47:48,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1346652.0, ans=0.0 2023-06-25 13:48:09,422 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-25 13:48:10,257 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.690e+02 3.805e+02 5.172e+02 7.672e+02 1.562e+03, threshold=1.034e+03, percent-clipped=4.0 2023-06-25 13:48:10,288 INFO [train.py:996] (3/4) Epoch 8, batch 11000, loss[loss=0.2508, simple_loss=0.3122, pruned_loss=0.09471, over 21793.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2878, pruned_loss=0.06843, over 4263581.26 frames. ], batch size: 441, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:48:16,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1346772.0, ans=0.05 2023-06-25 13:48:19,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1346772.0, ans=10.0 2023-06-25 13:48:20,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1346772.0, ans=0.125 2023-06-25 13:48:52,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346832.0, ans=0.1 2023-06-25 13:48:56,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1346892.0, ans=0.2 2023-06-25 13:48:59,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1346892.0, ans=0.2 2023-06-25 13:49:59,345 INFO [train.py:996] (3/4) Epoch 8, batch 11050, loss[loss=0.1985, simple_loss=0.2616, pruned_loss=0.06768, over 21756.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2858, pruned_loss=0.06952, over 4273863.84 frames. ], batch size: 300, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:50:14,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1347072.0, ans=0.125 2023-06-25 13:50:19,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1347132.0, ans=0.125 2023-06-25 13:51:04,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1347192.0, ans=0.125 2023-06-25 13:51:49,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.876e+02 3.834e+02 4.608e+02 6.864e+02 1.206e+03, threshold=9.217e+02, percent-clipped=3.0 2023-06-25 13:51:50,018 INFO [train.py:996] (3/4) Epoch 8, batch 11100, loss[loss=0.2053, simple_loss=0.2829, pruned_loss=0.06387, over 21756.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2853, pruned_loss=0.06862, over 4279816.56 frames. ], batch size: 351, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:51:53,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1347372.0, ans=0.0 2023-06-25 13:52:03,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1347372.0, ans=0.025 2023-06-25 13:52:03,590 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-25 13:52:15,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1347432.0, ans=0.125 2023-06-25 13:53:05,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-25 13:53:22,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1347612.0, ans=0.125 2023-06-25 13:53:37,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.39 vs. limit=10.0 2023-06-25 13:53:39,295 INFO [train.py:996] (3/4) Epoch 8, batch 11150, loss[loss=0.2195, simple_loss=0.3012, pruned_loss=0.06888, over 21306.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2846, pruned_loss=0.06882, over 4270396.37 frames. ], batch size: 144, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 13:53:48,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1347672.0, ans=0.0 2023-06-25 13:54:13,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1347732.0, ans=0.1 2023-06-25 13:54:51,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1347852.0, ans=0.125 2023-06-25 13:55:14,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1347912.0, ans=0.125 2023-06-25 13:55:16,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1347912.0, ans=0.125 2023-06-25 13:55:21,340 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:55:21,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1347972.0, ans=0.125 2023-06-25 13:55:22,808 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.920e+02 3.496e+02 4.428e+02 6.433e+02 1.139e+03, threshold=8.857e+02, percent-clipped=2.0 2023-06-25 13:55:22,839 INFO [train.py:996] (3/4) Epoch 8, batch 11200, loss[loss=0.2004, simple_loss=0.2643, pruned_loss=0.0682, over 21873.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2835, pruned_loss=0.06836, over 4270750.67 frames. ], batch size: 373, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 13:56:15,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1348092.0, ans=0.1 2023-06-25 13:56:23,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1348092.0, ans=0.0 2023-06-25 13:57:00,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1348212.0, ans=0.2 2023-06-25 13:57:10,452 INFO [train.py:996] (3/4) Epoch 8, batch 11250, loss[loss=0.2194, simple_loss=0.3006, pruned_loss=0.06911, over 21774.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2833, pruned_loss=0.06811, over 4262290.00 frames. ], batch size: 351, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 13:58:55,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1348512.0, ans=0.125 2023-06-25 13:58:59,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.719e+02 3.512e+02 4.278e+02 5.867e+02 1.075e+03, threshold=8.556e+02, percent-clipped=3.0 2023-06-25 13:58:59,671 INFO [train.py:996] (3/4) Epoch 8, batch 11300, loss[loss=0.1991, simple_loss=0.2874, pruned_loss=0.05536, over 21753.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2837, pruned_loss=0.06829, over 4269175.24 frames. ], batch size: 391, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 13:59:50,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.77 vs. limit=22.5 2023-06-25 13:59:56,245 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:00:36,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1348812.0, ans=0.1 2023-06-25 14:00:40,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1348812.0, ans=0.2 2023-06-25 14:00:49,637 INFO [train.py:996] (3/4) Epoch 8, batch 11350, loss[loss=0.2491, simple_loss=0.3279, pruned_loss=0.08513, over 21736.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2839, pruned_loss=0.06816, over 4262356.30 frames. ], batch size: 332, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:00:51,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1348872.0, ans=0.1 2023-06-25 14:02:19,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1349052.0, ans=0.125 2023-06-25 14:02:31,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-25 14:02:38,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1349112.0, ans=0.2 2023-06-25 14:02:41,718 INFO [train.py:996] (3/4) Epoch 8, batch 11400, loss[loss=0.1851, simple_loss=0.283, pruned_loss=0.04359, over 20706.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2911, pruned_loss=0.07086, over 4268379.61 frames. ], batch size: 607, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:02:42,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1349172.0, ans=0.0 2023-06-25 14:02:43,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.820e+02 3.968e+02 4.967e+02 6.707e+02 2.156e+03, threshold=9.935e+02, percent-clipped=13.0 2023-06-25 14:03:06,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1349172.0, ans=0.125 2023-06-25 14:03:33,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1349292.0, ans=0.125 2023-06-25 14:03:56,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1349352.0, ans=0.125 2023-06-25 14:04:36,640 INFO [train.py:996] (3/4) Epoch 8, batch 11450, loss[loss=0.2125, simple_loss=0.2942, pruned_loss=0.06533, over 21599.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2915, pruned_loss=0.06966, over 4268562.32 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:04:37,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1349472.0, ans=0.125 2023-06-25 14:06:04,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=12.0 2023-06-25 14:06:07,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1349712.0, ans=0.0 2023-06-25 14:06:09,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.88 vs. limit=10.0 2023-06-25 14:06:14,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1349712.0, ans=0.0 2023-06-25 14:06:33,204 INFO [train.py:996] (3/4) Epoch 8, batch 11500, loss[loss=0.2158, simple_loss=0.3175, pruned_loss=0.05703, over 21858.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2943, pruned_loss=0.07063, over 4266129.81 frames. ], batch size: 371, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:06:34,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.573e+02 4.073e+02 4.904e+02 7.356e+02 1.531e+03, threshold=9.808e+02, percent-clipped=13.0 2023-06-25 14:07:05,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1349832.0, ans=0.04949747468305833 2023-06-25 14:07:22,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1349892.0, ans=0.125 2023-06-25 14:07:31,528 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=12.0 2023-06-25 14:07:32,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1349892.0, ans=0.1 2023-06-25 14:07:39,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1349952.0, ans=0.0 2023-06-25 14:08:30,995 INFO [train.py:996] (3/4) Epoch 8, batch 11550, loss[loss=0.2691, simple_loss=0.3748, pruned_loss=0.0817, over 21748.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3003, pruned_loss=0.07108, over 4267753.76 frames. ], batch size: 351, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:08:43,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1350072.0, ans=0.0 2023-06-25 14:08:54,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1350132.0, ans=0.125 2023-06-25 14:09:15,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1350192.0, ans=0.0 2023-06-25 14:09:22,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1350192.0, ans=0.0 2023-06-25 14:09:22,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.84 vs. limit=15.0 2023-06-25 14:09:55,082 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:10:22,638 INFO [train.py:996] (3/4) Epoch 8, batch 11600, loss[loss=0.2315, simple_loss=0.3296, pruned_loss=0.0667, over 21647.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3123, pruned_loss=0.07267, over 4270677.17 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:10:24,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.914e+02 4.338e+02 5.534e+02 7.509e+02 2.145e+03, threshold=1.107e+03, percent-clipped=20.0 2023-06-25 14:10:32,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1350372.0, ans=0.2 2023-06-25 14:10:47,140 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.93 vs. limit=15.0 2023-06-25 14:10:48,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1350432.0, ans=0.0 2023-06-25 14:11:26,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1350552.0, ans=0.0 2023-06-25 14:11:56,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1350612.0, ans=0.125 2023-06-25 14:12:02,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-25 14:12:12,223 INFO [train.py:996] (3/4) Epoch 8, batch 11650, loss[loss=0.2818, simple_loss=0.4054, pruned_loss=0.07914, over 21189.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3184, pruned_loss=0.07291, over 4268859.64 frames. ], batch size: 549, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:12:25,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-25 14:12:39,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.02 vs. limit=22.5 2023-06-25 14:13:25,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1350852.0, ans=0.2 2023-06-25 14:13:37,260 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:13:55,107 INFO [train.py:996] (3/4) Epoch 8, batch 11700, loss[loss=0.1779, simple_loss=0.24, pruned_loss=0.05796, over 21606.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3099, pruned_loss=0.07192, over 4274750.73 frames. ], batch size: 231, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:13:58,338 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 3.697e+02 5.318e+02 8.205e+02 1.649e+03, threshold=1.064e+03, percent-clipped=10.0 2023-06-25 14:13:58,938 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:14:06,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1350972.0, ans=0.2 2023-06-25 14:15:14,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1351152.0, ans=0.0 2023-06-25 14:15:25,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1351212.0, ans=0.125 2023-06-25 14:15:27,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1351212.0, ans=0.125 2023-06-25 14:15:43,629 INFO [train.py:996] (3/4) Epoch 8, batch 11750, loss[loss=0.2131, simple_loss=0.2861, pruned_loss=0.07006, over 21579.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3005, pruned_loss=0.07153, over 4271510.56 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:15:54,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.30 vs. limit=22.5 2023-06-25 14:15:57,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-25 14:16:02,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1351272.0, ans=0.05 2023-06-25 14:16:13,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1351332.0, ans=0.125 2023-06-25 14:16:31,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1351392.0, ans=0.0 2023-06-25 14:16:51,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1351392.0, ans=0.125 2023-06-25 14:17:32,408 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-25 14:17:40,825 INFO [train.py:996] (3/4) Epoch 8, batch 11800, loss[loss=0.2997, simple_loss=0.3827, pruned_loss=0.1083, over 21406.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3032, pruned_loss=0.07329, over 4273362.30 frames. ], batch size: 507, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:17:44,262 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.704e+02 3.917e+02 5.538e+02 7.967e+02 1.804e+03, threshold=1.108e+03, percent-clipped=14.0 2023-06-25 14:17:53,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1351572.0, ans=0.1 2023-06-25 14:18:19,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1351692.0, ans=0.1 2023-06-25 14:19:30,724 INFO [train.py:996] (3/4) Epoch 8, batch 11850, loss[loss=0.2036, simple_loss=0.2963, pruned_loss=0.05547, over 21595.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3042, pruned_loss=0.07222, over 4275251.96 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:19:34,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1351872.0, ans=0.125 2023-06-25 14:19:39,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1351872.0, ans=0.0 2023-06-25 14:20:06,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1351932.0, ans=0.04949747468305833 2023-06-25 14:20:06,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1351932.0, ans=0.125 2023-06-25 14:20:09,255 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:21:22,239 INFO [train.py:996] (3/4) Epoch 8, batch 11900, loss[loss=0.2048, simple_loss=0.2954, pruned_loss=0.05708, over 21570.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3038, pruned_loss=0.06999, over 4277075.03 frames. ], batch size: 389, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:21:25,801 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.770e+02 3.589e+02 4.714e+02 6.474e+02 1.333e+03, threshold=9.428e+02, percent-clipped=3.0 2023-06-25 14:21:27,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-25 14:21:32,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1352172.0, ans=0.035 2023-06-25 14:21:44,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1352232.0, ans=0.1 2023-06-25 14:22:27,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1352292.0, ans=0.1 2023-06-25 14:22:52,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1352352.0, ans=0.125 2023-06-25 14:23:16,573 INFO [train.py:996] (3/4) Epoch 8, batch 11950, loss[loss=0.1876, simple_loss=0.2619, pruned_loss=0.05665, over 21823.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.3018, pruned_loss=0.06687, over 4275437.08 frames. ], batch size: 102, lr: 3.76e-03, grad_scale: 16.0 2023-06-25 14:24:15,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1352592.0, ans=0.125 2023-06-25 14:25:06,498 INFO [train.py:996] (3/4) Epoch 8, batch 12000, loss[loss=0.2178, simple_loss=0.2769, pruned_loss=0.07935, over 21227.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2955, pruned_loss=0.06561, over 4270834.83 frames. ], batch size: 160, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:25:06,499 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 14:25:31,285 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2626, simple_loss=0.3537, pruned_loss=0.08577, over 1796401.00 frames. 2023-06-25 14:25:31,286 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 14:25:32,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1352772.0, ans=0.125 2023-06-25 14:25:33,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1352772.0, ans=0.0 2023-06-25 14:25:33,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1352772.0, ans=0.1 2023-06-25 14:25:34,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.581e+02 4.444e+02 6.606e+02 1.302e+03, threshold=8.887e+02, percent-clipped=8.0 2023-06-25 14:25:38,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1352772.0, ans=0.2 2023-06-25 14:26:18,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1352892.0, ans=0.125 2023-06-25 14:26:32,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1352952.0, ans=0.0 2023-06-25 14:27:03,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1353012.0, ans=0.0 2023-06-25 14:27:08,722 INFO [train.py:996] (3/4) Epoch 8, batch 12050, loss[loss=0.2251, simple_loss=0.294, pruned_loss=0.0781, over 21399.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2939, pruned_loss=0.06733, over 4275651.43 frames. ], batch size: 159, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:27:23,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1353072.0, ans=0.0 2023-06-25 14:27:46,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1353132.0, ans=0.0 2023-06-25 14:28:01,949 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-25 14:29:10,832 INFO [train.py:996] (3/4) Epoch 8, batch 12100, loss[loss=0.2849, simple_loss=0.3534, pruned_loss=0.1081, over 21404.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3022, pruned_loss=0.07119, over 4271742.08 frames. ], batch size: 507, lr: 3.76e-03, grad_scale: 32.0 2023-06-25 14:29:14,309 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 4.401e+02 6.036e+02 8.453e+02 2.254e+03, threshold=1.207e+03, percent-clipped=23.0 2023-06-25 14:29:18,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1353372.0, ans=0.125 2023-06-25 14:29:34,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1353432.0, ans=15.0 2023-06-25 14:30:06,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1353492.0, ans=0.125 2023-06-25 14:30:18,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1353552.0, ans=0.125 2023-06-25 14:30:37,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1353552.0, ans=0.05 2023-06-25 14:31:09,991 INFO [train.py:996] (3/4) Epoch 8, batch 12150, loss[loss=0.2566, simple_loss=0.3566, pruned_loss=0.07834, over 21642.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3062, pruned_loss=0.07122, over 4271421.65 frames. ], batch size: 441, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:31:14,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1353672.0, ans=0.125 2023-06-25 14:31:14,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1353672.0, ans=0.1 2023-06-25 14:31:36,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1353732.0, ans=0.125 2023-06-25 14:32:48,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1353912.0, ans=0.07 2023-06-25 14:32:59,806 INFO [train.py:996] (3/4) Epoch 8, batch 12200, loss[loss=0.212, simple_loss=0.2712, pruned_loss=0.0764, over 21654.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3015, pruned_loss=0.07051, over 4272164.05 frames. ], batch size: 333, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:33:03,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.861e+02 3.926e+02 5.745e+02 7.853e+02 1.417e+03, threshold=1.149e+03, percent-clipped=2.0 2023-06-25 14:34:06,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1354152.0, ans=0.125 2023-06-25 14:34:24,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-25 14:34:37,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1354212.0, ans=0.125 2023-06-25 14:34:37,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1354212.0, ans=0.04949747468305833 2023-06-25 14:34:46,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1354272.0, ans=0.0 2023-06-25 14:34:47,604 INFO [train.py:996] (3/4) Epoch 8, batch 12250, loss[loss=0.1631, simple_loss=0.243, pruned_loss=0.04158, over 21584.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2933, pruned_loss=0.06762, over 4269211.38 frames. ], batch size: 263, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:35:14,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1354332.0, ans=0.125 2023-06-25 14:36:16,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1354512.0, ans=0.0 2023-06-25 14:36:16,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1354512.0, ans=0.2 2023-06-25 14:36:36,590 INFO [train.py:996] (3/4) Epoch 8, batch 12300, loss[loss=0.1968, simple_loss=0.2866, pruned_loss=0.0535, over 21881.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2853, pruned_loss=0.06219, over 4276348.35 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:36:37,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-25 14:36:41,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 3.509e+02 4.835e+02 7.096e+02 1.534e+03, threshold=9.669e+02, percent-clipped=2.0 2023-06-25 14:38:15,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1354812.0, ans=0.0 2023-06-25 14:38:25,379 INFO [train.py:996] (3/4) Epoch 8, batch 12350, loss[loss=0.2339, simple_loss=0.3283, pruned_loss=0.06974, over 21735.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2905, pruned_loss=0.06356, over 4280436.91 frames. ], batch size: 332, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:38:55,916 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=22.5 2023-06-25 14:39:21,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1355052.0, ans=0.2 2023-06-25 14:39:29,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1355052.0, ans=10.0 2023-06-25 14:39:45,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1355112.0, ans=0.1 2023-06-25 14:39:46,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1355112.0, ans=0.125 2023-06-25 14:40:09,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1355112.0, ans=0.2 2023-06-25 14:40:12,700 INFO [train.py:996] (3/4) Epoch 8, batch 12400, loss[loss=0.2224, simple_loss=0.291, pruned_loss=0.07688, over 21814.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2927, pruned_loss=0.06646, over 4285920.67 frames. ], batch size: 391, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:40:17,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.711e+02 4.388e+02 6.020e+02 7.604e+02 1.312e+03, threshold=1.204e+03, percent-clipped=10.0 2023-06-25 14:40:43,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-25 14:41:08,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1355292.0, ans=0.0 2023-06-25 14:41:15,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1355352.0, ans=0.0 2023-06-25 14:42:04,095 INFO [train.py:996] (3/4) Epoch 8, batch 12450, loss[loss=0.2612, simple_loss=0.3322, pruned_loss=0.09511, over 21533.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2951, pruned_loss=0.06943, over 4283381.34 frames. ], batch size: 414, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:42:08,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1355472.0, ans=0.125 2023-06-25 14:42:13,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1355472.0, ans=0.125 2023-06-25 14:42:25,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1355532.0, ans=0.0 2023-06-25 14:43:19,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1355652.0, ans=0.2 2023-06-25 14:43:44,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-25 14:43:55,912 INFO [train.py:996] (3/4) Epoch 8, batch 12500, loss[loss=0.2549, simple_loss=0.3547, pruned_loss=0.07754, over 21891.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.307, pruned_loss=0.07289, over 4288084.17 frames. ], batch size: 372, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:44:02,972 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.309e+02 4.292e+02 5.906e+02 9.269e+02 3.047e+03, threshold=1.181e+03, percent-clipped=14.0 2023-06-25 14:44:40,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1355892.0, ans=0.125 2023-06-25 14:44:41,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1355892.0, ans=0.125 2023-06-25 14:44:49,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1355892.0, ans=0.125 2023-06-25 14:45:27,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1356012.0, ans=0.1 2023-06-25 14:45:47,029 INFO [train.py:996] (3/4) Epoch 8, batch 12550, loss[loss=0.2266, simple_loss=0.3107, pruned_loss=0.07123, over 21645.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3105, pruned_loss=0.07562, over 4280824.18 frames. ], batch size: 263, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:45:56,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-06-25 14:46:11,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1356072.0, ans=0.0 2023-06-25 14:46:36,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1356132.0, ans=0.0 2023-06-25 14:46:44,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.79 vs. limit=10.0 2023-06-25 14:46:52,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1356192.0, ans=0.125 2023-06-25 14:47:24,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=15.0 2023-06-25 14:47:33,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-25 14:47:42,214 INFO [train.py:996] (3/4) Epoch 8, batch 12600, loss[loss=0.2265, simple_loss=0.3511, pruned_loss=0.05093, over 20791.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3098, pruned_loss=0.0738, over 4272122.93 frames. ], batch size: 608, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:47:44,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1356372.0, ans=0.0 2023-06-25 14:47:48,695 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.799e+02 4.195e+02 5.786e+02 8.769e+02 1.751e+03, threshold=1.157e+03, percent-clipped=8.0 2023-06-25 14:48:41,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-06-25 14:49:23,541 INFO [train.py:996] (3/4) Epoch 8, batch 12650, loss[loss=0.2173, simple_loss=0.295, pruned_loss=0.06984, over 21807.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3026, pruned_loss=0.07027, over 4281473.33 frames. ], batch size: 112, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:50:07,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1356732.0, ans=0.1 2023-06-25 14:50:48,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1356852.0, ans=0.0 2023-06-25 14:51:01,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1356912.0, ans=0.0 2023-06-25 14:51:19,758 INFO [train.py:996] (3/4) Epoch 8, batch 12700, loss[loss=0.3012, simple_loss=0.3577, pruned_loss=0.1224, over 21382.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3014, pruned_loss=0.07212, over 4283708.74 frames. ], batch size: 507, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:51:29,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1356972.0, ans=0.0 2023-06-25 14:51:29,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1356972.0, ans=0.0 2023-06-25 14:51:32,483 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 4.265e+02 5.595e+02 7.381e+02 1.572e+03, threshold=1.119e+03, percent-clipped=3.0 2023-06-25 14:52:19,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1357092.0, ans=0.0 2023-06-25 14:52:29,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1357152.0, ans=0.2 2023-06-25 14:52:59,793 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:53:00,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1357212.0, ans=15.0 2023-06-25 14:53:02,632 INFO [train.py:996] (3/4) Epoch 8, batch 12750, loss[loss=0.2321, simple_loss=0.3089, pruned_loss=0.0776, over 21772.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.303, pruned_loss=0.0714, over 4287787.34 frames. ], batch size: 414, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 14:53:26,669 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-25 14:53:57,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1357392.0, ans=0.0 2023-06-25 14:54:10,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1357452.0, ans=0.0 2023-06-25 14:54:34,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1357512.0, ans=0.0 2023-06-25 14:54:57,145 INFO [train.py:996] (3/4) Epoch 8, batch 12800, loss[loss=0.2226, simple_loss=0.305, pruned_loss=0.07015, over 21386.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3033, pruned_loss=0.0726, over 4289209.42 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:55:04,029 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.881e+02 3.698e+02 4.519e+02 5.409e+02 8.581e+02, threshold=9.039e+02, percent-clipped=0.0 2023-06-25 14:55:39,052 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:56:35,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1357812.0, ans=0.125 2023-06-25 14:56:47,890 INFO [train.py:996] (3/4) Epoch 8, batch 12850, loss[loss=0.2162, simple_loss=0.3218, pruned_loss=0.05526, over 19970.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3047, pruned_loss=0.07361, over 4287399.39 frames. ], batch size: 703, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:57:12,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1357932.0, ans=0.125 2023-06-25 14:57:17,721 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-25 14:58:40,148 INFO [train.py:996] (3/4) Epoch 8, batch 12900, loss[loss=0.1946, simple_loss=0.2666, pruned_loss=0.06132, over 21195.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3048, pruned_loss=0.07107, over 4279884.22 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 14:58:45,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1358172.0, ans=0.1 2023-06-25 14:58:47,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.654e+02 3.588e+02 4.373e+02 7.155e+02 1.857e+03, threshold=8.745e+02, percent-clipped=14.0 2023-06-25 14:58:48,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1358172.0, ans=0.125 2023-06-25 14:58:52,261 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.94 vs. limit=22.5 2023-06-25 14:58:54,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-25 14:59:15,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1358292.0, ans=0.125 2023-06-25 15:00:24,723 INFO [train.py:996] (3/4) Epoch 8, batch 12950, loss[loss=0.2311, simple_loss=0.3084, pruned_loss=0.07693, over 21700.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3031, pruned_loss=0.0692, over 4276586.03 frames. ], batch size: 298, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:00:52,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1358532.0, ans=0.025 2023-06-25 15:02:14,885 INFO [train.py:996] (3/4) Epoch 8, batch 13000, loss[loss=0.2178, simple_loss=0.3024, pruned_loss=0.06658, over 21699.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3039, pruned_loss=0.06978, over 4280633.10 frames. ], batch size: 415, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:02:18,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1358772.0, ans=0.1 2023-06-25 15:02:23,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.843e+02 4.886e+02 6.754e+02 1.173e+03, threshold=9.772e+02, percent-clipped=9.0 2023-06-25 15:02:27,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-25 15:02:39,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=15.0 2023-06-25 15:02:46,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1358832.0, ans=0.05 2023-06-25 15:03:21,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1358952.0, ans=0.125 2023-06-25 15:03:45,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1359012.0, ans=0.125 2023-06-25 15:03:57,487 INFO [train.py:996] (3/4) Epoch 8, batch 13050, loss[loss=0.2004, simple_loss=0.2782, pruned_loss=0.06128, over 21666.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.298, pruned_loss=0.06772, over 4284951.15 frames. ], batch size: 263, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:04:55,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1359192.0, ans=0.125 2023-06-25 15:05:02,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1359252.0, ans=0.125 2023-06-25 15:05:30,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1359312.0, ans=0.125 2023-06-25 15:05:34,717 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:05:41,251 INFO [train.py:996] (3/4) Epoch 8, batch 13100, loss[loss=0.2221, simple_loss=0.3013, pruned_loss=0.07139, over 21763.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2962, pruned_loss=0.06747, over 4281631.78 frames. ], batch size: 247, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:05:44,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-25 15:05:50,178 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.856e+02 3.427e+02 4.465e+02 6.179e+02 1.477e+03, threshold=8.931e+02, percent-clipped=2.0 2023-06-25 15:05:50,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1359372.0, ans=0.0 2023-06-25 15:06:21,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1359432.0, ans=0.125 2023-06-25 15:06:52,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1359492.0, ans=0.125 2023-06-25 15:07:19,694 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:07:21,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1359612.0, ans=0.0 2023-06-25 15:07:31,677 INFO [train.py:996] (3/4) Epoch 8, batch 13150, loss[loss=0.2634, simple_loss=0.329, pruned_loss=0.09887, over 21377.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2976, pruned_loss=0.06923, over 4282758.08 frames. ], batch size: 548, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:09:27,371 INFO [train.py:996] (3/4) Epoch 8, batch 13200, loss[loss=0.2431, simple_loss=0.3201, pruned_loss=0.08307, over 21212.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2975, pruned_loss=0.07015, over 4288451.81 frames. ], batch size: 143, lr: 3.75e-03, grad_scale: 32.0 2023-06-25 15:09:27,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1359972.0, ans=0.025 2023-06-25 15:09:46,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.744e+02 3.706e+02 4.388e+02 6.661e+02 1.084e+03, threshold=8.775e+02, percent-clipped=9.0 2023-06-25 15:10:00,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1360032.0, ans=0.125 2023-06-25 15:10:10,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1360092.0, ans=0.0 2023-06-25 15:10:51,603 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:11:04,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1360212.0, ans=0.1 2023-06-25 15:11:21,329 INFO [train.py:996] (3/4) Epoch 8, batch 13250, loss[loss=0.2126, simple_loss=0.2888, pruned_loss=0.06817, over 21412.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2974, pruned_loss=0.07169, over 4292931.39 frames. ], batch size: 194, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:12:14,746 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=12.0 2023-06-25 15:12:37,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1360452.0, ans=0.125 2023-06-25 15:13:18,520 INFO [train.py:996] (3/4) Epoch 8, batch 13300, loss[loss=0.243, simple_loss=0.3177, pruned_loss=0.08415, over 21316.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3009, pruned_loss=0.0714, over 4291222.46 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 16.0 2023-06-25 15:13:34,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.717e+02 5.105e+02 6.593e+02 1.654e+03, threshold=1.021e+03, percent-clipped=11.0 2023-06-25 15:15:02,606 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-25 15:15:05,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1360812.0, ans=0.0 2023-06-25 15:15:08,200 INFO [train.py:996] (3/4) Epoch 8, batch 13350, loss[loss=0.2152, simple_loss=0.3009, pruned_loss=0.06475, over 20674.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3053, pruned_loss=0.07393, over 4287374.42 frames. ], batch size: 607, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:16:11,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1361052.0, ans=0.125 2023-06-25 15:16:51,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1361112.0, ans=0.125 2023-06-25 15:17:03,310 INFO [train.py:996] (3/4) Epoch 8, batch 13400, loss[loss=0.2879, simple_loss=0.344, pruned_loss=0.1159, over 21458.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3071, pruned_loss=0.07564, over 4293735.17 frames. ], batch size: 507, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:17:13,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.122e+02 3.939e+02 4.986e+02 7.057e+02 1.760e+03, threshold=9.973e+02, percent-clipped=5.0 2023-06-25 15:17:54,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1361292.0, ans=0.0 2023-06-25 15:18:05,140 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-25 15:18:25,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.61 vs. limit=10.0 2023-06-25 15:18:51,623 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:18:52,665 INFO [train.py:996] (3/4) Epoch 8, batch 13450, loss[loss=0.2871, simple_loss=0.3399, pruned_loss=0.1172, over 21451.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3081, pruned_loss=0.07762, over 4291490.73 frames. ], batch size: 509, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:19:23,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1361532.0, ans=0.0 2023-06-25 15:19:54,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1361592.0, ans=0.125 2023-06-25 15:20:02,405 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-06-25 15:20:12,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1361652.0, ans=0.05 2023-06-25 15:20:25,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1361712.0, ans=0.2 2023-06-25 15:20:42,821 INFO [train.py:996] (3/4) Epoch 8, batch 13500, loss[loss=0.2105, simple_loss=0.2848, pruned_loss=0.06807, over 21706.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3005, pruned_loss=0.07499, over 4285477.24 frames. ], batch size: 247, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:20:53,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.704e+02 3.900e+02 4.940e+02 7.289e+02 1.559e+03, threshold=9.879e+02, percent-clipped=7.0 2023-06-25 15:20:54,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1361772.0, ans=0.125 2023-06-25 15:20:57,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1361772.0, ans=0.0 2023-06-25 15:21:05,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1361832.0, ans=0.125 2023-06-25 15:21:57,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1361952.0, ans=0.2 2023-06-25 15:22:34,510 INFO [train.py:996] (3/4) Epoch 8, batch 13550, loss[loss=0.2177, simple_loss=0.2991, pruned_loss=0.06817, over 21793.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3052, pruned_loss=0.07397, over 4288850.04 frames. ], batch size: 124, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:22:43,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1362072.0, ans=0.0 2023-06-25 15:23:03,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1362132.0, ans=0.125 2023-06-25 15:23:36,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-25 15:23:59,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=12.0 2023-06-25 15:24:02,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1362312.0, ans=0.0 2023-06-25 15:24:18,226 INFO [train.py:996] (3/4) Epoch 8, batch 13600, loss[loss=0.2158, simple_loss=0.2879, pruned_loss=0.07183, over 21525.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3058, pruned_loss=0.07443, over 4290727.38 frames. ], batch size: 211, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:24:28,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.890e+02 3.859e+02 5.232e+02 7.287e+02 1.567e+03, threshold=1.046e+03, percent-clipped=12.0 2023-06-25 15:24:38,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1362432.0, ans=0.0 2023-06-25 15:25:10,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-25 15:25:15,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1362492.0, ans=0.0 2023-06-25 15:25:25,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1362552.0, ans=0.1 2023-06-25 15:25:39,972 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=22.5 2023-06-25 15:26:01,148 INFO [train.py:996] (3/4) Epoch 8, batch 13650, loss[loss=0.1966, simple_loss=0.2645, pruned_loss=0.06438, over 21848.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2989, pruned_loss=0.07098, over 4284578.94 frames. ], batch size: 118, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:26:03,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1362672.0, ans=0.125 2023-06-25 15:27:45,828 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:27:50,082 INFO [train.py:996] (3/4) Epoch 8, batch 13700, loss[loss=0.2473, simple_loss=0.3277, pruned_loss=0.08346, over 21654.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2936, pruned_loss=0.07048, over 4273577.88 frames. ], batch size: 414, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:27:56,196 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-25 15:27:58,013 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:28:08,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 3.641e+02 4.705e+02 7.070e+02 1.116e+03, threshold=9.410e+02, percent-clipped=4.0 2023-06-25 15:28:31,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1363032.0, ans=0.125 2023-06-25 15:28:40,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363092.0, ans=0.1 2023-06-25 15:28:48,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1363092.0, ans=0.0 2023-06-25 15:29:46,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-06-25 15:29:46,967 INFO [train.py:996] (3/4) Epoch 8, batch 13750, loss[loss=0.2894, simple_loss=0.3575, pruned_loss=0.1107, over 21499.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2916, pruned_loss=0.07004, over 4263914.38 frames. ], batch size: 508, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:30:10,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1363332.0, ans=0.0 2023-06-25 15:31:42,858 INFO [train.py:996] (3/4) Epoch 8, batch 13800, loss[loss=0.3333, simple_loss=0.4197, pruned_loss=0.1234, over 21492.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2957, pruned_loss=0.06973, over 4257627.55 frames. ], batch size: 507, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:32:00,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.906e+02 4.517e+02 6.756e+02 9.995e+02 2.111e+03, threshold=1.351e+03, percent-clipped=26.0 2023-06-25 15:32:19,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1363632.0, ans=0.125 2023-06-25 15:33:24,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363812.0, ans=0.1 2023-06-25 15:33:33,362 INFO [train.py:996] (3/4) Epoch 8, batch 13850, loss[loss=0.2745, simple_loss=0.3579, pruned_loss=0.09552, over 21272.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3036, pruned_loss=0.07099, over 4261357.44 frames. ], batch size: 548, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:33:35,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1363872.0, ans=0.125 2023-06-25 15:33:49,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1363932.0, ans=0.04949747468305833 2023-06-25 15:33:56,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1363932.0, ans=0.0 2023-06-25 15:34:00,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1363932.0, ans=0.125 2023-06-25 15:35:20,916 INFO [train.py:996] (3/4) Epoch 8, batch 13900, loss[loss=0.2143, simple_loss=0.2765, pruned_loss=0.07605, over 21129.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3075, pruned_loss=0.07397, over 4266915.98 frames. ], batch size: 607, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:35:33,171 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.067e+02 4.054e+02 4.959e+02 6.399e+02 1.364e+03, threshold=9.918e+02, percent-clipped=1.0 2023-06-25 15:35:37,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1364232.0, ans=0.125 2023-06-25 15:35:47,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1364232.0, ans=0.0 2023-06-25 15:36:43,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1364352.0, ans=0.2 2023-06-25 15:37:09,428 INFO [train.py:996] (3/4) Epoch 8, batch 13950, loss[loss=0.1965, simple_loss=0.2615, pruned_loss=0.06574, over 20829.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3075, pruned_loss=0.0755, over 4273162.23 frames. ], batch size: 608, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:37:17,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1364472.0, ans=0.125 2023-06-25 15:38:02,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-25 15:38:17,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1364652.0, ans=0.2 2023-06-25 15:38:57,950 INFO [train.py:996] (3/4) Epoch 8, batch 14000, loss[loss=0.2137, simple_loss=0.3173, pruned_loss=0.05508, over 21706.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3047, pruned_loss=0.07357, over 4282131.90 frames. ], batch size: 389, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:39:03,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1364772.0, ans=0.125 2023-06-25 15:39:03,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1364772.0, ans=0.0 2023-06-25 15:39:09,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.689e+02 3.751e+02 4.894e+02 7.186e+02 1.368e+03, threshold=9.787e+02, percent-clipped=13.0 2023-06-25 15:39:22,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1364832.0, ans=0.125 2023-06-25 15:39:26,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1364832.0, ans=0.1 2023-06-25 15:39:32,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1364892.0, ans=0.0 2023-06-25 15:39:50,540 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:39:59,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1364952.0, ans=0.125 2023-06-25 15:40:15,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1364952.0, ans=0.2 2023-06-25 15:40:40,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1365012.0, ans=0.0 2023-06-25 15:40:45,638 INFO [train.py:996] (3/4) Epoch 8, batch 14050, loss[loss=0.2289, simple_loss=0.3106, pruned_loss=0.07358, over 21554.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2998, pruned_loss=0.06991, over 4279963.88 frames. ], batch size: 471, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:41:36,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1365192.0, ans=0.125 2023-06-25 15:42:04,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1365252.0, ans=0.125 2023-06-25 15:42:27,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1365312.0, ans=0.125 2023-06-25 15:42:33,614 INFO [train.py:996] (3/4) Epoch 8, batch 14100, loss[loss=0.2161, simple_loss=0.3343, pruned_loss=0.04893, over 20755.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2941, pruned_loss=0.06961, over 4275722.26 frames. ], batch size: 607, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:42:47,586 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.656e+02 3.476e+02 4.443e+02 5.620e+02 1.211e+03, threshold=8.886e+02, percent-clipped=2.0 2023-06-25 15:43:01,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1365432.0, ans=0.0 2023-06-25 15:43:56,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1365612.0, ans=0.125 2023-06-25 15:44:19,902 INFO [train.py:996] (3/4) Epoch 8, batch 14150, loss[loss=0.2015, simple_loss=0.2864, pruned_loss=0.05827, over 21183.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2971, pruned_loss=0.07021, over 4263126.02 frames. ], batch size: 176, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:44:39,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1365732.0, ans=0.0 2023-06-25 15:45:22,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1365852.0, ans=0.125 2023-06-25 15:46:01,155 INFO [train.py:996] (3/4) Epoch 8, batch 14200, loss[loss=0.1879, simple_loss=0.2618, pruned_loss=0.057, over 21582.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2971, pruned_loss=0.06908, over 4258558.32 frames. ], batch size: 230, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:46:20,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 4.879e+02 7.691e+02 1.070e+03 2.190e+03, threshold=1.538e+03, percent-clipped=38.0 2023-06-25 15:46:22,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1366032.0, ans=0.125 2023-06-25 15:46:33,302 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.43 vs. limit=15.0 2023-06-25 15:46:38,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1366032.0, ans=0.2 2023-06-25 15:46:44,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1366092.0, ans=0.0 2023-06-25 15:47:01,309 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=12.0 2023-06-25 15:47:10,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1366152.0, ans=0.95 2023-06-25 15:47:28,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1366152.0, ans=0.5 2023-06-25 15:47:42,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1366212.0, ans=0.0 2023-06-25 15:47:49,378 INFO [train.py:996] (3/4) Epoch 8, batch 14250, loss[loss=0.2191, simple_loss=0.2953, pruned_loss=0.07141, over 21420.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2923, pruned_loss=0.06922, over 4268563.53 frames. ], batch size: 508, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:48:13,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1366332.0, ans=0.0 2023-06-25 15:48:27,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1366332.0, ans=0.125 2023-06-25 15:48:27,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.13 vs. limit=12.0 2023-06-25 15:48:30,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1366392.0, ans=0.0 2023-06-25 15:49:01,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1366452.0, ans=0.125 2023-06-25 15:49:15,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1366452.0, ans=0.125 2023-06-25 15:49:16,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1366452.0, ans=0.125 2023-06-25 15:49:31,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1366512.0, ans=0.125 2023-06-25 15:49:33,902 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-25 15:49:39,565 INFO [train.py:996] (3/4) Epoch 8, batch 14300, loss[loss=0.1855, simple_loss=0.2615, pruned_loss=0.05477, over 21186.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2931, pruned_loss=0.06859, over 4266646.83 frames. ], batch size: 176, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:49:59,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.539e+02 3.382e+02 4.720e+02 7.552e+02 1.673e+03, threshold=9.439e+02, percent-clipped=2.0 2023-06-25 15:51:11,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1366812.0, ans=0.125 2023-06-25 15:51:15,209 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:51:23,203 INFO [train.py:996] (3/4) Epoch 8, batch 14350, loss[loss=0.2399, simple_loss=0.3451, pruned_loss=0.06737, over 19706.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2968, pruned_loss=0.06851, over 4262381.27 frames. ], batch size: 703, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:51:40,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.63 vs. limit=22.5 2023-06-25 15:51:41,784 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-25 15:52:04,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1366992.0, ans=0.2 2023-06-25 15:52:17,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1366992.0, ans=6.0 2023-06-25 15:53:02,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1367112.0, ans=0.1 2023-06-25 15:53:17,500 INFO [train.py:996] (3/4) Epoch 8, batch 14400, loss[loss=0.2113, simple_loss=0.2738, pruned_loss=0.07437, over 21831.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2949, pruned_loss=0.0687, over 4258809.56 frames. ], batch size: 98, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:53:30,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.862e+02 3.850e+02 4.891e+02 6.324e+02 1.594e+03, threshold=9.783e+02, percent-clipped=6.0 2023-06-25 15:53:46,077 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-25 15:53:46,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.63 vs. limit=15.0 2023-06-25 15:54:37,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1367412.0, ans=0.125 2023-06-25 15:54:53,580 INFO [train.py:996] (3/4) Epoch 8, batch 14450, loss[loss=0.1865, simple_loss=0.2542, pruned_loss=0.05937, over 21490.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2894, pruned_loss=0.06846, over 4256210.10 frames. ], batch size: 212, lr: 3.74e-03, grad_scale: 32.0 2023-06-25 15:56:22,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1367712.0, ans=0.125 2023-06-25 15:56:32,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1367712.0, ans=0.125 2023-06-25 15:56:34,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1367712.0, ans=0.125 2023-06-25 15:56:39,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-25 15:56:40,131 INFO [train.py:996] (3/4) Epoch 8, batch 14500, loss[loss=0.2103, simple_loss=0.2707, pruned_loss=0.07495, over 21662.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2858, pruned_loss=0.06865, over 4255075.07 frames. ], batch size: 416, lr: 3.74e-03, grad_scale: 16.0 2023-06-25 15:56:57,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1367772.0, ans=0.2 2023-06-25 15:57:02,148 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.716e+02 3.452e+02 4.183e+02 6.174e+02 1.088e+03, threshold=8.366e+02, percent-clipped=1.0 2023-06-25 15:57:20,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1367892.0, ans=0.125 2023-06-25 15:58:04,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1367952.0, ans=0.125 2023-06-25 15:58:33,913 INFO [train.py:996] (3/4) Epoch 8, batch 14550, loss[loss=0.2154, simple_loss=0.2707, pruned_loss=0.08005, over 20051.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2897, pruned_loss=0.07033, over 4260221.49 frames. ], batch size: 703, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 15:58:36,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-25 15:58:38,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-25 15:59:11,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1368132.0, ans=0.0 2023-06-25 15:59:22,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1368192.0, ans=0.125 2023-06-25 15:59:30,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1368192.0, ans=0.2 2023-06-25 15:59:58,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.39 vs. limit=15.0 2023-06-25 16:00:15,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1368312.0, ans=0.125 2023-06-25 16:00:22,819 INFO [train.py:996] (3/4) Epoch 8, batch 14600, loss[loss=0.2334, simple_loss=0.3238, pruned_loss=0.07149, over 21870.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2981, pruned_loss=0.07408, over 4268103.87 frames. ], batch size: 371, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:00:38,178 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 4.717e+02 6.049e+02 8.556e+02 1.756e+03, threshold=1.210e+03, percent-clipped=27.0 2023-06-25 16:00:51,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1368432.0, ans=0.04949747468305833 2023-06-25 16:00:54,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-25 16:01:09,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1368492.0, ans=0.125 2023-06-25 16:01:27,260 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:01:37,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1368552.0, ans=0.2 2023-06-25 16:01:41,655 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-25 16:02:10,907 INFO [train.py:996] (3/4) Epoch 8, batch 14650, loss[loss=0.163, simple_loss=0.2522, pruned_loss=0.03686, over 21612.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2994, pruned_loss=0.07273, over 4275282.00 frames. ], batch size: 263, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:02:55,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1368792.0, ans=0.125 2023-06-25 16:03:13,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1368792.0, ans=0.2 2023-06-25 16:03:42,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-25 16:03:42,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1368912.0, ans=0.125 2023-06-25 16:03:58,466 INFO [train.py:996] (3/4) Epoch 8, batch 14700, loss[loss=0.1674, simple_loss=0.2445, pruned_loss=0.04516, over 21419.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2947, pruned_loss=0.06743, over 4266861.11 frames. ], batch size: 194, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:04:14,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.677e+02 4.958e+02 7.109e+02 1.155e+03, threshold=9.917e+02, percent-clipped=0.0 2023-06-25 16:04:21,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-25 16:05:43,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1369212.0, ans=0.05 2023-06-25 16:05:50,270 INFO [train.py:996] (3/4) Epoch 8, batch 14750, loss[loss=0.2747, simple_loss=0.348, pruned_loss=0.1007, over 21207.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2987, pruned_loss=0.06992, over 4262838.41 frames. ], batch size: 143, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:06:13,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1369332.0, ans=0.125 2023-06-25 16:06:28,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1369332.0, ans=0.0 2023-06-25 16:06:28,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1369332.0, ans=0.0 2023-06-25 16:06:31,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1369332.0, ans=0.125 2023-06-25 16:07:16,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1369452.0, ans=0.0 2023-06-25 16:07:18,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-25 16:07:35,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1369512.0, ans=0.035 2023-06-25 16:07:47,162 INFO [train.py:996] (3/4) Epoch 8, batch 14800, loss[loss=0.2337, simple_loss=0.3087, pruned_loss=0.0793, over 19999.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3121, pruned_loss=0.07601, over 4267454.63 frames. ], batch size: 702, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:08:12,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.189e+02 4.811e+02 6.847e+02 1.023e+03 2.171e+03, threshold=1.369e+03, percent-clipped=26.0 2023-06-25 16:08:31,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1369692.0, ans=0.125 2023-06-25 16:08:53,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1369692.0, ans=0.2 2023-06-25 16:09:43,105 INFO [train.py:996] (3/4) Epoch 8, batch 14850, loss[loss=0.2031, simple_loss=0.2718, pruned_loss=0.06717, over 21527.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3062, pruned_loss=0.07582, over 4265247.68 frames. ], batch size: 230, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:10:30,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1369992.0, ans=0.125 2023-06-25 16:11:24,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-25 16:11:39,838 INFO [train.py:996] (3/4) Epoch 8, batch 14900, loss[loss=0.222, simple_loss=0.2933, pruned_loss=0.07538, over 21484.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3082, pruned_loss=0.07746, over 4264885.88 frames. ], batch size: 194, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:11:40,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1370172.0, ans=0.125 2023-06-25 16:11:45,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1370172.0, ans=0.1 2023-06-25 16:11:57,469 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.090e+02 4.206e+02 5.469e+02 8.347e+02 1.577e+03, threshold=1.094e+03, percent-clipped=2.0 2023-06-25 16:12:19,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1370292.0, ans=0.0 2023-06-25 16:12:27,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1370292.0, ans=0.0 2023-06-25 16:13:30,835 INFO [train.py:996] (3/4) Epoch 8, batch 14950, loss[loss=0.2073, simple_loss=0.2697, pruned_loss=0.07242, over 20064.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3081, pruned_loss=0.07651, over 4268093.48 frames. ], batch size: 702, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:13:37,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=1370472.0, ans=0.02 2023-06-25 16:14:18,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1370592.0, ans=0.125 2023-06-25 16:15:12,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1370712.0, ans=0.125 2023-06-25 16:15:19,944 INFO [train.py:996] (3/4) Epoch 8, batch 15000, loss[loss=0.2344, simple_loss=0.3033, pruned_loss=0.08273, over 21280.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3109, pruned_loss=0.07797, over 4268017.45 frames. ], batch size: 159, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:15:19,945 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 16:15:40,721 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2554, simple_loss=0.3473, pruned_loss=0.08173, over 1796401.00 frames. 2023-06-25 16:15:40,722 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 16:15:58,825 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.944e+02 3.850e+02 4.977e+02 6.696e+02 1.113e+03, threshold=9.953e+02, percent-clipped=2.0 2023-06-25 16:16:03,035 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:16:03,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1370832.0, ans=0.125 2023-06-25 16:16:36,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1370892.0, ans=0.1 2023-06-25 16:16:59,627 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=15.0 2023-06-25 16:17:30,923 INFO [train.py:996] (3/4) Epoch 8, batch 15050, loss[loss=0.2188, simple_loss=0.297, pruned_loss=0.07033, over 21615.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3117, pruned_loss=0.07763, over 4258110.08 frames. ], batch size: 230, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:17:51,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-25 16:17:52,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1371132.0, ans=0.2 2023-06-25 16:17:54,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1371132.0, ans=0.125 2023-06-25 16:18:52,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1371252.0, ans=0.125 2023-06-25 16:18:56,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1371312.0, ans=0.1 2023-06-25 16:19:20,745 INFO [train.py:996] (3/4) Epoch 8, batch 15100, loss[loss=0.2684, simple_loss=0.3436, pruned_loss=0.09663, over 21568.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3143, pruned_loss=0.07823, over 4257764.83 frames. ], batch size: 414, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:19:30,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1371372.0, ans=0.2 2023-06-25 16:19:43,658 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.981e+02 4.480e+02 6.447e+02 8.808e+02 1.442e+03, threshold=1.289e+03, percent-clipped=16.0 2023-06-25 16:20:10,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1371492.0, ans=0.0 2023-06-25 16:20:51,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1371612.0, ans=0.0 2023-06-25 16:20:56,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=15.0 2023-06-25 16:21:09,587 INFO [train.py:996] (3/4) Epoch 8, batch 15150, loss[loss=0.1903, simple_loss=0.2615, pruned_loss=0.05954, over 21994.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.309, pruned_loss=0.07742, over 4260748.01 frames. ], batch size: 103, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:22:08,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1371792.0, ans=0.1 2023-06-25 16:22:39,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1371912.0, ans=0.125 2023-06-25 16:22:57,843 INFO [train.py:996] (3/4) Epoch 8, batch 15200, loss[loss=0.1922, simple_loss=0.2758, pruned_loss=0.05433, over 21722.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3019, pruned_loss=0.07439, over 4255180.60 frames. ], batch size: 282, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:23:26,361 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.586e+02 3.888e+02 5.742e+02 8.749e+02 1.820e+03, threshold=1.148e+03, percent-clipped=6.0 2023-06-25 16:23:38,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-25 16:23:47,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1372092.0, ans=0.1 2023-06-25 16:24:52,720 INFO [train.py:996] (3/4) Epoch 8, batch 15250, loss[loss=0.1924, simple_loss=0.2591, pruned_loss=0.06285, over 21371.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.296, pruned_loss=0.0738, over 4264957.56 frames. ], batch size: 131, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:24:53,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1372272.0, ans=0.0 2023-06-25 16:25:40,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-25 16:26:23,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1372512.0, ans=0.125 2023-06-25 16:26:48,268 INFO [train.py:996] (3/4) Epoch 8, batch 15300, loss[loss=0.2349, simple_loss=0.3104, pruned_loss=0.07965, over 21282.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2993, pruned_loss=0.07555, over 4258629.69 frames. ], batch size: 176, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:26:50,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1372572.0, ans=0.2 2023-06-25 16:27:04,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1372572.0, ans=0.2 2023-06-25 16:27:05,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.78 vs. limit=10.0 2023-06-25 16:27:12,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.814e+02 3.962e+02 5.141e+02 6.603e+02 1.300e+03, threshold=1.028e+03, percent-clipped=5.0 2023-06-25 16:27:29,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1372692.0, ans=0.125 2023-06-25 16:27:29,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1372692.0, ans=0.125 2023-06-25 16:27:29,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1372692.0, ans=0.125 2023-06-25 16:27:35,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-25 16:28:18,490 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-25 16:28:30,829 INFO [train.py:996] (3/4) Epoch 8, batch 15350, loss[loss=0.2223, simple_loss=0.3208, pruned_loss=0.06191, over 21634.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3033, pruned_loss=0.07695, over 4267566.63 frames. ], batch size: 263, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:29:33,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1372992.0, ans=0.0 2023-06-25 16:29:58,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1373112.0, ans=0.125 2023-06-25 16:30:13,111 INFO [train.py:996] (3/4) Epoch 8, batch 15400, loss[loss=0.22, simple_loss=0.3042, pruned_loss=0.06791, over 21903.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.305, pruned_loss=0.07587, over 4272764.04 frames. ], batch size: 107, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:30:44,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1373232.0, ans=0.2 2023-06-25 16:30:46,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1373232.0, ans=0.0 2023-06-25 16:30:46,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1373232.0, ans=0.125 2023-06-25 16:30:46,935 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.892e+02 4.124e+02 5.602e+02 8.412e+02 1.592e+03, threshold=1.120e+03, percent-clipped=11.0 2023-06-25 16:32:01,306 INFO [train.py:996] (3/4) Epoch 8, batch 15450, loss[loss=0.2718, simple_loss=0.3572, pruned_loss=0.09315, over 21585.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3031, pruned_loss=0.0754, over 4261262.14 frames. ], batch size: 471, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:32:46,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1373532.0, ans=0.125 2023-06-25 16:32:46,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1373532.0, ans=0.125 2023-06-25 16:32:57,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1373592.0, ans=0.125 2023-06-25 16:33:11,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.12 vs. limit=15.0 2023-06-25 16:33:27,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1373652.0, ans=0.2 2023-06-25 16:34:02,539 INFO [train.py:996] (3/4) Epoch 8, batch 15500, loss[loss=0.2206, simple_loss=0.2994, pruned_loss=0.07087, over 21687.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3053, pruned_loss=0.07588, over 4264451.75 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:34:27,238 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 3.957e+02 5.678e+02 7.705e+02 1.506e+03, threshold=1.136e+03, percent-clipped=3.0 2023-06-25 16:35:18,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1373952.0, ans=0.125 2023-06-25 16:35:18,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1373952.0, ans=0.0 2023-06-25 16:35:53,646 INFO [train.py:996] (3/4) Epoch 8, batch 15550, loss[loss=0.2051, simple_loss=0.2858, pruned_loss=0.06222, over 21703.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3028, pruned_loss=0.07356, over 4259853.98 frames. ], batch size: 332, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:35:54,166 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:35:55,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1374072.0, ans=0.0 2023-06-25 16:37:14,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.86 vs. limit=10.0 2023-06-25 16:37:42,353 INFO [train.py:996] (3/4) Epoch 8, batch 15600, loss[loss=0.222, simple_loss=0.2892, pruned_loss=0.07736, over 21804.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2961, pruned_loss=0.07171, over 4259338.73 frames. ], batch size: 98, lr: 3.73e-03, grad_scale: 32.0 2023-06-25 16:37:58,363 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:38:01,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.371e+02 3.943e+02 5.908e+02 1.274e+03, threshold=7.887e+02, percent-clipped=2.0 2023-06-25 16:38:32,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1374492.0, ans=0.125 2023-06-25 16:38:46,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1374552.0, ans=0.125 2023-06-25 16:39:04,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.59 vs. limit=10.0 2023-06-25 16:39:30,800 INFO [train.py:996] (3/4) Epoch 8, batch 15650, loss[loss=0.1944, simple_loss=0.2614, pruned_loss=0.06372, over 21374.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2941, pruned_loss=0.07067, over 4253012.56 frames. ], batch size: 194, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:39:52,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1374732.0, ans=0.1 2023-06-25 16:40:10,878 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:40:55,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-25 16:41:19,310 INFO [train.py:996] (3/4) Epoch 8, batch 15700, loss[loss=0.2133, simple_loss=0.2889, pruned_loss=0.06884, over 21494.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2906, pruned_loss=0.07, over 4258687.87 frames. ], batch size: 389, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:41:40,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.691e+02 3.513e+02 4.156e+02 5.605e+02 1.068e+03, threshold=8.312e+02, percent-clipped=8.0 2023-06-25 16:42:02,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1375092.0, ans=0.125 2023-06-25 16:42:04,488 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:42:16,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1375152.0, ans=0.125 2023-06-25 16:43:00,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1375212.0, ans=0.125 2023-06-25 16:43:06,881 INFO [train.py:996] (3/4) Epoch 8, batch 15750, loss[loss=0.2261, simple_loss=0.2814, pruned_loss=0.08536, over 21843.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2869, pruned_loss=0.06996, over 4258157.54 frames. ], batch size: 98, lr: 3.73e-03, grad_scale: 16.0 2023-06-25 16:44:07,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1375452.0, ans=0.125 2023-06-25 16:44:55,702 INFO [train.py:996] (3/4) Epoch 8, batch 15800, loss[loss=0.1887, simple_loss=0.2564, pruned_loss=0.0605, over 21521.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2826, pruned_loss=0.0695, over 4246099.23 frames. ], batch size: 195, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:45:15,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1375632.0, ans=0.0 2023-06-25 16:45:16,666 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.881e+02 4.132e+02 5.788e+02 8.606e+02 2.042e+03, threshold=1.158e+03, percent-clipped=26.0 2023-06-25 16:45:54,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1375752.0, ans=0.125 2023-06-25 16:46:32,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1375812.0, ans=0.1 2023-06-25 16:46:39,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1375812.0, ans=0.0 2023-06-25 16:46:44,086 INFO [train.py:996] (3/4) Epoch 8, batch 15850, loss[loss=0.1895, simple_loss=0.2564, pruned_loss=0.06124, over 21738.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2853, pruned_loss=0.07101, over 4257510.86 frames. ], batch size: 112, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:47:20,301 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.43 vs. limit=22.5 2023-06-25 16:47:21,883 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-25 16:47:38,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1375992.0, ans=0.125 2023-06-25 16:48:19,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2023-06-25 16:48:25,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1376112.0, ans=0.0 2023-06-25 16:48:32,033 INFO [train.py:996] (3/4) Epoch 8, batch 15900, loss[loss=0.2339, simple_loss=0.3109, pruned_loss=0.07839, over 21841.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2844, pruned_loss=0.07066, over 4259367.25 frames. ], batch size: 118, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:48:52,535 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.061e+02 4.407e+02 5.744e+02 8.356e+02 1.559e+03, threshold=1.149e+03, percent-clipped=5.0 2023-06-25 16:50:19,060 INFO [train.py:996] (3/4) Epoch 8, batch 15950, loss[loss=0.2276, simple_loss=0.323, pruned_loss=0.0661, over 21775.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2852, pruned_loss=0.06902, over 4253255.81 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:50:28,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1376472.0, ans=0.1 2023-06-25 16:50:35,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1376532.0, ans=0.125 2023-06-25 16:51:02,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1376592.0, ans=0.1 2023-06-25 16:51:04,249 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.95 vs. limit=6.0 2023-06-25 16:52:10,846 INFO [train.py:996] (3/4) Epoch 8, batch 16000, loss[loss=0.2225, simple_loss=0.3117, pruned_loss=0.06666, over 21821.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2851, pruned_loss=0.06641, over 4256489.94 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 32.0 2023-06-25 16:52:16,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1376772.0, ans=0.125 2023-06-25 16:52:23,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1376772.0, ans=0.125 2023-06-25 16:52:30,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1376832.0, ans=0.05 2023-06-25 16:52:31,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.505e+02 3.799e+02 4.877e+02 8.252e+02 1.708e+03, threshold=9.755e+02, percent-clipped=5.0 2023-06-25 16:52:43,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1376832.0, ans=0.2 2023-06-25 16:53:01,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-25 16:53:39,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1377012.0, ans=0.125 2023-06-25 16:53:55,017 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=12.0 2023-06-25 16:53:59,489 INFO [train.py:996] (3/4) Epoch 8, batch 16050, loss[loss=0.2079, simple_loss=0.2963, pruned_loss=0.05976, over 21663.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.288, pruned_loss=0.06514, over 4260320.01 frames. ], batch size: 230, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:54:30,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1377132.0, ans=0.1 2023-06-25 16:54:33,422 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-25 16:54:53,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1377192.0, ans=0.125 2023-06-25 16:54:57,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1377252.0, ans=0.125 2023-06-25 16:55:47,387 INFO [train.py:996] (3/4) Epoch 8, batch 16100, loss[loss=0.2285, simple_loss=0.2961, pruned_loss=0.0804, over 21335.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2918, pruned_loss=0.06742, over 4265862.83 frames. ], batch size: 159, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:56:10,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.825e+02 4.315e+02 5.631e+02 9.006e+02 2.276e+03, threshold=1.126e+03, percent-clipped=22.0 2023-06-25 16:56:10,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1377432.0, ans=0.1 2023-06-25 16:56:29,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1377492.0, ans=0.125 2023-06-25 16:56:32,143 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=22.5 2023-06-25 16:56:44,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1377552.0, ans=0.125 2023-06-25 16:57:35,030 INFO [train.py:996] (3/4) Epoch 8, batch 16150, loss[loss=0.217, simple_loss=0.2801, pruned_loss=0.07695, over 21333.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2919, pruned_loss=0.0693, over 4278343.11 frames. ], batch size: 176, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:58:29,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1377792.0, ans=0.125 2023-06-25 16:58:42,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1377852.0, ans=0.125 2023-06-25 16:59:14,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-25 16:59:23,922 INFO [train.py:996] (3/4) Epoch 8, batch 16200, loss[loss=0.2535, simple_loss=0.336, pruned_loss=0.08551, over 21702.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2968, pruned_loss=0.07119, over 4278532.03 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 16:59:46,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.109e+02 4.018e+02 5.082e+02 7.447e+02 1.479e+03, threshold=1.016e+03, percent-clipped=6.0 2023-06-25 16:59:58,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1378092.0, ans=0.0 2023-06-25 17:00:20,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=15.0 2023-06-25 17:01:08,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1378212.0, ans=0.125 2023-06-25 17:01:11,826 INFO [train.py:996] (3/4) Epoch 8, batch 16250, loss[loss=0.1904, simple_loss=0.2588, pruned_loss=0.06102, over 21848.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2957, pruned_loss=0.07043, over 4277889.18 frames. ], batch size: 118, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:01:59,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.12 vs. limit=10.0 2023-06-25 17:02:05,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1378392.0, ans=10.0 2023-06-25 17:02:05,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1378392.0, ans=0.125 2023-06-25 17:02:26,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1378452.0, ans=0.2 2023-06-25 17:02:35,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1378452.0, ans=0.0 2023-06-25 17:02:47,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1378512.0, ans=0.2 2023-06-25 17:03:00,427 INFO [train.py:996] (3/4) Epoch 8, batch 16300, loss[loss=0.1781, simple_loss=0.244, pruned_loss=0.05609, over 21866.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2915, pruned_loss=0.06686, over 4271399.60 frames. ], batch size: 107, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:03:24,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.309e+02 4.494e+02 6.869e+02 1.781e+03, threshold=8.988e+02, percent-clipped=11.0 2023-06-25 17:03:44,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1378692.0, ans=0.125 2023-06-25 17:04:50,598 INFO [train.py:996] (3/4) Epoch 8, batch 16350, loss[loss=0.2025, simple_loss=0.282, pruned_loss=0.06155, over 21781.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2916, pruned_loss=0.06741, over 4270482.53 frames. ], batch size: 102, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:04:52,917 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:04:56,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1378872.0, ans=0.07 2023-06-25 17:05:40,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1378992.0, ans=0.125 2023-06-25 17:06:15,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1379052.0, ans=0.07 2023-06-25 17:06:39,414 INFO [train.py:996] (3/4) Epoch 8, batch 16400, loss[loss=0.2017, simple_loss=0.2771, pruned_loss=0.06316, over 21929.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2977, pruned_loss=0.07046, over 4272740.56 frames. ], batch size: 107, lr: 3.72e-03, grad_scale: 32.0 2023-06-25 17:06:41,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1379172.0, ans=0.1 2023-06-25 17:06:43,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1379172.0, ans=0.125 2023-06-25 17:06:43,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1379172.0, ans=0.0 2023-06-25 17:06:47,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-25 17:07:09,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.348e+02 5.266e+02 7.750e+02 2.110e+03, threshold=1.053e+03, percent-clipped=17.0 2023-06-25 17:07:28,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1379292.0, ans=0.04949747468305833 2023-06-25 17:07:29,302 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-25 17:07:44,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1379292.0, ans=0.125 2023-06-25 17:07:56,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1379352.0, ans=0.125 2023-06-25 17:08:01,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1379352.0, ans=0.2 2023-06-25 17:08:05,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1379412.0, ans=0.125 2023-06-25 17:08:22,731 INFO [train.py:996] (3/4) Epoch 8, batch 16450, loss[loss=0.25, simple_loss=0.3142, pruned_loss=0.09286, over 21783.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2978, pruned_loss=0.07185, over 4276132.72 frames. ], batch size: 441, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:10:08,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1379712.0, ans=0.0 2023-06-25 17:10:10,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1379712.0, ans=0.125 2023-06-25 17:10:12,897 INFO [train.py:996] (3/4) Epoch 8, batch 16500, loss[loss=0.187, simple_loss=0.2569, pruned_loss=0.05851, over 21667.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2965, pruned_loss=0.07206, over 4276477.66 frames. ], batch size: 263, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:10:13,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1379772.0, ans=0.0 2023-06-25 17:10:24,863 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-25 17:10:43,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.095e+02 4.406e+02 5.923e+02 9.341e+02 2.012e+03, threshold=1.185e+03, percent-clipped=18.0 2023-06-25 17:11:22,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1379892.0, ans=0.125 2023-06-25 17:11:57,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-06-25 17:12:02,737 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2023-06-25 17:12:03,232 INFO [train.py:996] (3/4) Epoch 8, batch 16550, loss[loss=0.2156, simple_loss=0.2893, pruned_loss=0.071, over 20018.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2932, pruned_loss=0.07008, over 4269028.33 frames. ], batch size: 702, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:12:26,655 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.27 vs. limit=22.5 2023-06-25 17:13:26,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1380252.0, ans=0.125 2023-06-25 17:13:33,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1380252.0, ans=0.0 2023-06-25 17:14:05,610 INFO [train.py:996] (3/4) Epoch 8, batch 16600, loss[loss=0.2449, simple_loss=0.3483, pruned_loss=0.07074, over 21300.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3004, pruned_loss=0.07242, over 4267681.00 frames. ], batch size: 176, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:14:08,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1380372.0, ans=0.0 2023-06-25 17:14:35,305 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-06-25 17:14:41,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.056e+02 4.921e+02 6.632e+02 9.394e+02 2.372e+03, threshold=1.326e+03, percent-clipped=11.0 2023-06-25 17:14:51,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=12.0 2023-06-25 17:15:02,013 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=22.5 2023-06-25 17:15:03,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1380492.0, ans=0.0 2023-06-25 17:15:06,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1380492.0, ans=0.125 2023-06-25 17:15:09,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1380552.0, ans=0.2 2023-06-25 17:15:32,645 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.76 vs. limit=12.0 2023-06-25 17:16:01,839 INFO [train.py:996] (3/4) Epoch 8, batch 16650, loss[loss=0.2755, simple_loss=0.3465, pruned_loss=0.1022, over 21761.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3091, pruned_loss=0.07471, over 4271699.85 frames. ], batch size: 441, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:16:16,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1380672.0, ans=0.125 2023-06-25 17:16:27,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1380732.0, ans=0.1 2023-06-25 17:16:29,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1380732.0, ans=0.0 2023-06-25 17:16:52,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1380792.0, ans=0.0 2023-06-25 17:18:00,230 INFO [train.py:996] (3/4) Epoch 8, batch 16700, loss[loss=0.2289, simple_loss=0.3115, pruned_loss=0.07316, over 21727.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3095, pruned_loss=0.07554, over 4267970.70 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:18:05,473 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-25 17:18:21,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1381032.0, ans=0.035 2023-06-25 17:18:23,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1381032.0, ans=0.5 2023-06-25 17:18:26,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.158e+02 5.069e+02 7.220e+02 1.088e+03 2.234e+03, threshold=1.444e+03, percent-clipped=12.0 2023-06-25 17:18:54,452 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-25 17:19:00,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1381092.0, ans=0.1 2023-06-25 17:19:02,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1381092.0, ans=0.1 2023-06-25 17:19:06,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1381092.0, ans=0.09899494936611666 2023-06-25 17:19:32,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1381152.0, ans=0.0 2023-06-25 17:19:34,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1381212.0, ans=0.125 2023-06-25 17:19:54,796 INFO [train.py:996] (3/4) Epoch 8, batch 16750, loss[loss=0.2442, simple_loss=0.3415, pruned_loss=0.07338, over 21711.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3125, pruned_loss=0.07729, over 4267805.73 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:20:05,732 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:21:02,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1381392.0, ans=0.125 2023-06-25 17:21:12,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1381452.0, ans=0.0 2023-06-25 17:21:47,561 INFO [train.py:996] (3/4) Epoch 8, batch 16800, loss[loss=0.2543, simple_loss=0.3529, pruned_loss=0.07782, over 20701.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3158, pruned_loss=0.07735, over 4261077.55 frames. ], batch size: 607, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:21:50,121 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:21:52,286 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=22.5 2023-06-25 17:21:53,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1381572.0, ans=0.0 2023-06-25 17:22:18,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.357e+02 4.342e+02 5.532e+02 7.799e+02 1.934e+03, threshold=1.106e+03, percent-clipped=5.0 2023-06-25 17:23:23,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1381872.0, ans=0.1 2023-06-25 17:23:24,338 INFO [train.py:996] (3/4) Epoch 8, batch 16850, loss[loss=0.2361, simple_loss=0.2976, pruned_loss=0.08728, over 21944.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3117, pruned_loss=0.07716, over 4269545.64 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:23:26,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1381872.0, ans=0.1 2023-06-25 17:23:41,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.09 vs. limit=10.0 2023-06-25 17:24:19,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1381992.0, ans=0.0 2023-06-25 17:24:27,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1381992.0, ans=0.0 2023-06-25 17:24:34,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1381992.0, ans=0.0 2023-06-25 17:25:11,471 INFO [train.py:996] (3/4) Epoch 8, batch 16900, loss[loss=0.1986, simple_loss=0.2737, pruned_loss=0.0617, over 21392.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3082, pruned_loss=0.07626, over 4269604.66 frames. ], batch size: 131, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:25:26,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1382172.0, ans=0.07 2023-06-25 17:25:58,354 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.598e+02 4.085e+02 5.568e+02 7.476e+02 1.428e+03, threshold=1.114e+03, percent-clipped=3.0 2023-06-25 17:26:09,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1382292.0, ans=0.125 2023-06-25 17:26:23,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1382352.0, ans=0.0 2023-06-25 17:26:33,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1382352.0, ans=10.0 2023-06-25 17:26:59,154 INFO [train.py:996] (3/4) Epoch 8, batch 16950, loss[loss=0.2568, simple_loss=0.3024, pruned_loss=0.1056, over 21765.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3018, pruned_loss=0.0747, over 4273607.50 frames. ], batch size: 508, lr: 3.72e-03, grad_scale: 16.0 2023-06-25 17:27:06,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1382472.0, ans=0.125 2023-06-25 17:27:34,746 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:27:37,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1382532.0, ans=0.125 2023-06-25 17:28:03,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1382592.0, ans=0.125 2023-06-25 17:28:28,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1382712.0, ans=0.02 2023-06-25 17:28:52,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1382772.0, ans=0.125 2023-06-25 17:28:53,780 INFO [train.py:996] (3/4) Epoch 8, batch 17000, loss[loss=0.2344, simple_loss=0.3, pruned_loss=0.08438, over 21799.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2991, pruned_loss=0.0753, over 4281837.02 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:29:35,990 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.768e+02 4.400e+02 6.237e+02 1.054e+03 1.925e+03, threshold=1.247e+03, percent-clipped=22.0 2023-06-25 17:30:15,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=12.0 2023-06-25 17:30:47,374 INFO [train.py:996] (3/4) Epoch 8, batch 17050, loss[loss=0.2347, simple_loss=0.3058, pruned_loss=0.08183, over 21899.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3064, pruned_loss=0.07666, over 4280904.89 frames. ], batch size: 107, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:30:50,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=22.5 2023-06-25 17:31:40,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1383192.0, ans=0.0 2023-06-25 17:32:29,756 INFO [train.py:996] (3/4) Epoch 8, batch 17100, loss[loss=0.2169, simple_loss=0.291, pruned_loss=0.07138, over 21863.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3058, pruned_loss=0.07692, over 4289578.43 frames. ], batch size: 124, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:32:58,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1383432.0, ans=0.125 2023-06-25 17:33:06,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.094e+02 4.538e+02 6.730e+02 8.383e+02 1.322e+03, threshold=1.346e+03, percent-clipped=2.0 2023-06-25 17:33:11,388 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-25 17:33:16,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1383492.0, ans=0.07 2023-06-25 17:33:58,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1383612.0, ans=0.1 2023-06-25 17:34:12,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1383612.0, ans=0.2 2023-06-25 17:34:23,412 INFO [train.py:996] (3/4) Epoch 8, batch 17150, loss[loss=0.1724, simple_loss=0.2462, pruned_loss=0.04928, over 21255.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3013, pruned_loss=0.07615, over 4290242.89 frames. ], batch size: 176, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:34:43,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1383672.0, ans=0.0 2023-06-25 17:35:07,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1383792.0, ans=0.025 2023-06-25 17:35:11,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1383792.0, ans=0.0 2023-06-25 17:35:18,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1383792.0, ans=0.1 2023-06-25 17:36:15,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1383912.0, ans=0.0 2023-06-25 17:36:18,567 INFO [train.py:996] (3/4) Epoch 8, batch 17200, loss[loss=0.2705, simple_loss=0.35, pruned_loss=0.09549, over 21797.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3016, pruned_loss=0.07617, over 4287918.32 frames. ], batch size: 124, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 17:36:29,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1383972.0, ans=0.125 2023-06-25 17:36:31,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.56 vs. limit=15.0 2023-06-25 17:36:44,519 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 4.225e+02 5.384e+02 7.580e+02 1.533e+03, threshold=1.077e+03, percent-clipped=1.0 2023-06-25 17:36:56,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=15.0 2023-06-25 17:37:20,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1384152.0, ans=0.0 2023-06-25 17:37:56,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1384212.0, ans=0.125 2023-06-25 17:38:07,336 INFO [train.py:996] (3/4) Epoch 8, batch 17250, loss[loss=0.2648, simple_loss=0.345, pruned_loss=0.09231, over 21250.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3045, pruned_loss=0.07809, over 4289077.02 frames. ], batch size: 143, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:38:26,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1384332.0, ans=0.125 2023-06-25 17:38:54,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1384392.0, ans=0.1 2023-06-25 17:39:54,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1384512.0, ans=0.1 2023-06-25 17:39:57,086 INFO [train.py:996] (3/4) Epoch 8, batch 17300, loss[loss=0.2553, simple_loss=0.3298, pruned_loss=0.09039, over 21318.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.311, pruned_loss=0.08073, over 4284049.80 frames. ], batch size: 131, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:40:25,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.394e+02 4.465e+02 6.350e+02 1.043e+03 2.141e+03, threshold=1.270e+03, percent-clipped=19.0 2023-06-25 17:40:38,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1384692.0, ans=0.125 2023-06-25 17:41:23,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-25 17:41:45,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1384812.0, ans=0.125 2023-06-25 17:41:47,988 INFO [train.py:996] (3/4) Epoch 8, batch 17350, loss[loss=0.2413, simple_loss=0.3315, pruned_loss=0.07554, over 21646.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3119, pruned_loss=0.08057, over 4281128.74 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:41:48,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1384872.0, ans=0.125 2023-06-25 17:41:52,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1384872.0, ans=0.0 2023-06-25 17:41:59,816 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.34 vs. limit=10.0 2023-06-25 17:43:05,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1385052.0, ans=0.125 2023-06-25 17:43:13,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1385052.0, ans=0.95 2023-06-25 17:43:15,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1385052.0, ans=0.125 2023-06-25 17:43:38,076 INFO [train.py:996] (3/4) Epoch 8, batch 17400, loss[loss=0.1901, simple_loss=0.261, pruned_loss=0.05959, over 21429.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3082, pruned_loss=0.07695, over 4274095.15 frames. ], batch size: 211, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:43:58,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1385172.0, ans=0.1 2023-06-25 17:44:17,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1385232.0, ans=0.2 2023-06-25 17:44:17,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1385232.0, ans=0.125 2023-06-25 17:44:22,054 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.624e+02 3.735e+02 4.973e+02 6.706e+02 2.674e+03, threshold=9.946e+02, percent-clipped=3.0 2023-06-25 17:44:38,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1385292.0, ans=0.0 2023-06-25 17:45:32,727 INFO [train.py:996] (3/4) Epoch 8, batch 17450, loss[loss=0.1956, simple_loss=0.2945, pruned_loss=0.04837, over 21670.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3027, pruned_loss=0.07354, over 4271375.31 frames. ], batch size: 414, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:46:44,789 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-25 17:47:08,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1385712.0, ans=0.0 2023-06-25 17:47:20,323 INFO [train.py:996] (3/4) Epoch 8, batch 17500, loss[loss=0.2023, simple_loss=0.2734, pruned_loss=0.0656, over 21752.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3001, pruned_loss=0.07111, over 4276538.32 frames. ], batch size: 230, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:47:57,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.917e+02 3.737e+02 5.034e+02 7.979e+02 1.418e+03, threshold=1.007e+03, percent-clipped=12.0 2023-06-25 17:48:09,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1385892.0, ans=0.125 2023-06-25 17:49:07,124 INFO [train.py:996] (3/4) Epoch 8, batch 17550, loss[loss=0.2029, simple_loss=0.2466, pruned_loss=0.07957, over 20361.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2999, pruned_loss=0.07046, over 4271617.53 frames. ], batch size: 703, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:49:13,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1386072.0, ans=0.0 2023-06-25 17:49:14,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1386072.0, ans=0.125 2023-06-25 17:49:14,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1386072.0, ans=0.125 2023-06-25 17:49:27,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.23 vs. limit=15.0 2023-06-25 17:50:04,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1386192.0, ans=0.0 2023-06-25 17:50:11,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1386252.0, ans=0.125 2023-06-25 17:50:54,175 INFO [train.py:996] (3/4) Epoch 8, batch 17600, loss[loss=0.2404, simple_loss=0.3287, pruned_loss=0.076, over 21823.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3019, pruned_loss=0.0708, over 4271500.18 frames. ], batch size: 124, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 17:51:00,767 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-25 17:51:01,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1386372.0, ans=0.2 2023-06-25 17:51:24,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1386432.0, ans=0.0 2023-06-25 17:51:33,991 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.904e+02 3.918e+02 5.459e+02 7.837e+02 1.902e+03, threshold=1.092e+03, percent-clipped=12.0 2023-06-25 17:51:56,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.15 vs. limit=15.0 2023-06-25 17:52:13,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1386552.0, ans=0.2 2023-06-25 17:52:29,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1386612.0, ans=0.0 2023-06-25 17:52:32,120 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.96 vs. limit=15.0 2023-06-25 17:52:43,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1386672.0, ans=0.125 2023-06-25 17:52:43,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1386672.0, ans=0.1 2023-06-25 17:52:49,584 INFO [train.py:996] (3/4) Epoch 8, batch 17650, loss[loss=0.193, simple_loss=0.2603, pruned_loss=0.06288, over 21756.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2998, pruned_loss=0.07068, over 4264803.66 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 17:53:50,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1386852.0, ans=0.2 2023-06-25 17:54:22,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1386912.0, ans=0.125 2023-06-25 17:54:39,346 INFO [train.py:996] (3/4) Epoch 8, batch 17700, loss[loss=0.2766, simple_loss=0.3558, pruned_loss=0.09869, over 21432.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2936, pruned_loss=0.06782, over 4267173.45 frames. ], batch size: 471, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:55:03,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-25 17:55:14,586 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.821e+02 4.539e+02 6.208e+02 9.459e+02 1.772e+03, threshold=1.242e+03, percent-clipped=17.0 2023-06-25 17:56:01,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-25 17:56:10,583 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:56:12,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1387212.0, ans=10.0 2023-06-25 17:56:34,262 INFO [train.py:996] (3/4) Epoch 8, batch 17750, loss[loss=0.2748, simple_loss=0.3428, pruned_loss=0.1034, over 21791.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3013, pruned_loss=0.07073, over 4269791.91 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:56:39,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1387272.0, ans=0.95 2023-06-25 17:57:08,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-06-25 17:57:10,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1387392.0, ans=0.2 2023-06-25 17:57:52,565 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:57:56,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1387452.0, ans=0.1 2023-06-25 17:58:06,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1387512.0, ans=0.2 2023-06-25 17:58:25,859 INFO [train.py:996] (3/4) Epoch 8, batch 17800, loss[loss=0.1811, simple_loss=0.2578, pruned_loss=0.05219, over 21566.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.301, pruned_loss=0.07019, over 4269881.44 frames. ], batch size: 112, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 17:58:55,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 4.160e+02 4.945e+02 7.686e+02 1.227e+03, threshold=9.890e+02, percent-clipped=0.0 2023-06-25 18:00:00,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1387812.0, ans=0.0 2023-06-25 18:00:10,385 INFO [train.py:996] (3/4) Epoch 8, batch 17850, loss[loss=0.2282, simple_loss=0.3025, pruned_loss=0.07702, over 21780.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.303, pruned_loss=0.07193, over 4273058.03 frames. ], batch size: 247, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:00:16,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.92 vs. limit=15.0 2023-06-25 18:01:41,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1388112.0, ans=0.125 2023-06-25 18:01:54,651 INFO [train.py:996] (3/4) Epoch 8, batch 17900, loss[loss=0.2526, simple_loss=0.3508, pruned_loss=0.07718, over 21647.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3079, pruned_loss=0.07371, over 4278155.07 frames. ], batch size: 414, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:02:40,618 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.096e+02 4.760e+02 6.226e+02 9.356e+02 2.163e+03, threshold=1.245e+03, percent-clipped=21.0 2023-06-25 18:03:35,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1388412.0, ans=0.0 2023-06-25 18:03:43,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1388472.0, ans=0.125 2023-06-25 18:03:44,512 INFO [train.py:996] (3/4) Epoch 8, batch 17950, loss[loss=0.1779, simple_loss=0.2614, pruned_loss=0.04726, over 21820.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3072, pruned_loss=0.07088, over 4270832.21 frames. ], batch size: 118, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:03:58,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1388472.0, ans=0.07 2023-06-25 18:04:08,859 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-25 18:04:36,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1388592.0, ans=0.0 2023-06-25 18:05:27,072 INFO [train.py:996] (3/4) Epoch 8, batch 18000, loss[loss=0.2163, simple_loss=0.285, pruned_loss=0.07377, over 21500.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3011, pruned_loss=0.06927, over 4273914.29 frames. ], batch size: 441, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 18:05:27,073 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 18:05:48,119 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2638, simple_loss=0.3571, pruned_loss=0.08527, over 1796401.00 frames. 2023-06-25 18:05:48,120 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 18:06:13,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.78 vs. limit=22.5 2023-06-25 18:06:23,287 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.397e+02 3.503e+02 4.294e+02 6.004e+02 1.457e+03, threshold=8.588e+02, percent-clipped=3.0 2023-06-25 18:06:53,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1388952.0, ans=0.0 2023-06-25 18:07:37,528 INFO [train.py:996] (3/4) Epoch 8, batch 18050, loss[loss=0.2452, simple_loss=0.3106, pruned_loss=0.08995, over 21716.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2954, pruned_loss=0.06824, over 4274762.90 frames. ], batch size: 231, lr: 3.71e-03, grad_scale: 32.0 2023-06-25 18:07:38,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1389072.0, ans=0.1 2023-06-25 18:09:33,659 INFO [train.py:996] (3/4) Epoch 8, batch 18100, loss[loss=0.223, simple_loss=0.2887, pruned_loss=0.07865, over 20053.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2989, pruned_loss=0.07078, over 4270732.86 frames. ], batch size: 702, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:09:45,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1389372.0, ans=0.1 2023-06-25 18:10:05,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.803e+02 3.773e+02 4.901e+02 6.840e+02 2.108e+03, threshold=9.801e+02, percent-clipped=15.0 2023-06-25 18:10:22,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-25 18:11:22,663 INFO [train.py:996] (3/4) Epoch 8, batch 18150, loss[loss=0.2518, simple_loss=0.3491, pruned_loss=0.07722, over 19881.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3014, pruned_loss=0.07108, over 4275147.18 frames. ], batch size: 702, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:11:40,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1389732.0, ans=0.125 2023-06-25 18:11:42,631 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-25 18:11:59,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1389792.0, ans=0.125 2023-06-25 18:12:07,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1389792.0, ans=0.0 2023-06-25 18:13:09,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=22.5 2023-06-25 18:13:10,100 INFO [train.py:996] (3/4) Epoch 8, batch 18200, loss[loss=0.2193, simple_loss=0.2832, pruned_loss=0.07772, over 21579.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2958, pruned_loss=0.07011, over 4281754.85 frames. ], batch size: 415, lr: 3.71e-03, grad_scale: 16.0 2023-06-25 18:13:12,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.39 vs. limit=15.0 2023-06-25 18:13:14,803 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=12.0 2023-06-25 18:13:40,433 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.735e+02 4.047e+02 5.658e+02 8.715e+02 2.136e+03, threshold=1.132e+03, percent-clipped=16.0 2023-06-25 18:14:43,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1390272.0, ans=0.125 2023-06-25 18:14:49,914 INFO [train.py:996] (3/4) Epoch 8, batch 18250, loss[loss=0.1906, simple_loss=0.2628, pruned_loss=0.05922, over 21638.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2886, pruned_loss=0.06767, over 4285418.97 frames. ], batch size: 263, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:15:10,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=22.5 2023-06-25 18:15:33,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1390392.0, ans=0.2 2023-06-25 18:15:43,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1390392.0, ans=0.2 2023-06-25 18:15:55,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1390452.0, ans=0.0 2023-06-25 18:15:55,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1390452.0, ans=0.2 2023-06-25 18:16:23,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1390512.0, ans=0.0 2023-06-25 18:16:26,074 INFO [train.py:996] (3/4) Epoch 8, batch 18300, loss[loss=0.1636, simple_loss=0.2394, pruned_loss=0.04387, over 21823.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2891, pruned_loss=0.06829, over 4284637.28 frames. ], batch size: 102, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:17:12,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 4.046e+02 5.831e+02 1.006e+03 2.196e+03, threshold=1.166e+03, percent-clipped=19.0 2023-06-25 18:17:32,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-25 18:17:33,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1390752.0, ans=0.125 2023-06-25 18:17:45,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1390752.0, ans=0.2 2023-06-25 18:18:12,346 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2023-06-25 18:18:12,752 INFO [train.py:996] (3/4) Epoch 8, batch 18350, loss[loss=0.2396, simple_loss=0.3549, pruned_loss=0.06219, over 19872.00 frames. ], tot_loss[loss=0.214, simple_loss=0.292, pruned_loss=0.06795, over 4248925.96 frames. ], batch size: 702, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:18:50,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1390932.0, ans=0.125 2023-06-25 18:18:58,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-25 18:19:05,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1390992.0, ans=0.025 2023-06-25 18:19:48,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1391112.0, ans=0.125 2023-06-25 18:19:56,468 INFO [train.py:996] (3/4) Epoch 8, batch 18400, loss[loss=0.1613, simple_loss=0.236, pruned_loss=0.04332, over 16187.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2898, pruned_loss=0.0668, over 4236090.11 frames. ], batch size: 60, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:20:25,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1391232.0, ans=0.0 2023-06-25 18:20:38,558 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.733e+02 5.113e+02 7.460e+02 1.718e+03, threshold=1.023e+03, percent-clipped=6.0 2023-06-25 18:20:41,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1391292.0, ans=0.125 2023-06-25 18:21:24,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1391412.0, ans=0.0 2023-06-25 18:21:30,458 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-25 18:21:45,616 INFO [train.py:996] (3/4) Epoch 8, batch 18450, loss[loss=0.2046, simple_loss=0.2773, pruned_loss=0.06598, over 21849.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2872, pruned_loss=0.06353, over 4246737.99 frames. ], batch size: 107, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:21:59,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1391472.0, ans=15.0 2023-06-25 18:22:02,634 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:22:31,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1391592.0, ans=0.1 2023-06-25 18:22:36,190 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-25 18:22:57,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1391652.0, ans=0.125 2023-06-25 18:23:09,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1391712.0, ans=0.0 2023-06-25 18:23:26,117 INFO [train.py:996] (3/4) Epoch 8, batch 18500, loss[loss=0.1666, simple_loss=0.2365, pruned_loss=0.04833, over 21427.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2828, pruned_loss=0.06294, over 4257250.00 frames. ], batch size: 212, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:23:55,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1391832.0, ans=0.125 2023-06-25 18:24:07,378 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.590e+02 3.343e+02 4.214e+02 5.911e+02 1.246e+03, threshold=8.429e+02, percent-clipped=4.0 2023-06-25 18:24:31,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1391952.0, ans=0.125 2023-06-25 18:24:31,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1391952.0, ans=0.125 2023-06-25 18:24:33,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1391952.0, ans=0.0 2023-06-25 18:24:44,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1391952.0, ans=0.125 2023-06-25 18:25:09,771 INFO [train.py:996] (3/4) Epoch 8, batch 18550, loss[loss=0.2241, simple_loss=0.2948, pruned_loss=0.07672, over 21933.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2801, pruned_loss=0.06254, over 4259866.00 frames. ], batch size: 103, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:25:29,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=22.5 2023-06-25 18:25:44,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1392132.0, ans=0.0 2023-06-25 18:25:49,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-25 18:26:28,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1392252.0, ans=0.125 2023-06-25 18:26:35,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1392312.0, ans=0.0 2023-06-25 18:26:39,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-25 18:27:04,656 INFO [train.py:996] (3/4) Epoch 8, batch 18600, loss[loss=0.2761, simple_loss=0.3537, pruned_loss=0.09928, over 21520.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2787, pruned_loss=0.0634, over 4252702.96 frames. ], batch size: 473, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:27:36,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 3.804e+02 5.092e+02 7.468e+02 1.783e+03, threshold=1.018e+03, percent-clipped=18.0 2023-06-25 18:28:05,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1392552.0, ans=0.2 2023-06-25 18:28:10,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1392552.0, ans=0.125 2023-06-25 18:28:21,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1392612.0, ans=0.125 2023-06-25 18:28:33,916 INFO [train.py:996] (3/4) Epoch 8, batch 18650, loss[loss=0.1981, simple_loss=0.2721, pruned_loss=0.06204, over 21780.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2773, pruned_loss=0.06323, over 4248753.46 frames. ], batch size: 102, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:29:06,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1392732.0, ans=0.125 2023-06-25 18:29:37,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1392852.0, ans=0.0 2023-06-25 18:30:03,695 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-25 18:30:11,650 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:30:15,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1392972.0, ans=0.125 2023-06-25 18:30:16,047 INFO [train.py:996] (3/4) Epoch 8, batch 18700, loss[loss=0.2444, simple_loss=0.2946, pruned_loss=0.09716, over 21528.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2753, pruned_loss=0.06459, over 4256605.35 frames. ], batch size: 471, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:30:50,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1393032.0, ans=0.0 2023-06-25 18:31:04,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.953e+02 3.708e+02 4.986e+02 6.996e+02 1.849e+03, threshold=9.973e+02, percent-clipped=6.0 2023-06-25 18:31:28,171 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-25 18:31:45,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1393212.0, ans=0.1 2023-06-25 18:32:02,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-25 18:32:03,260 INFO [train.py:996] (3/4) Epoch 8, batch 18750, loss[loss=0.2388, simple_loss=0.3138, pruned_loss=0.08196, over 21797.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2795, pruned_loss=0.0675, over 4250873.78 frames. ], batch size: 124, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:32:03,952 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=3.193e-03 2023-06-25 18:32:06,143 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-25 18:32:08,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1393272.0, ans=0.1 2023-06-25 18:32:33,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1393332.0, ans=0.125 2023-06-25 18:33:12,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-25 18:33:14,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1393452.0, ans=0.0 2023-06-25 18:33:40,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1393512.0, ans=0.125 2023-06-25 18:33:43,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1393512.0, ans=0.125 2023-06-25 18:33:48,491 INFO [train.py:996] (3/4) Epoch 8, batch 18800, loss[loss=0.2283, simple_loss=0.3187, pruned_loss=0.06894, over 21694.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2854, pruned_loss=0.06869, over 4239660.94 frames. ], batch size: 441, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:34:22,331 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:34:31,897 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.846e+02 4.247e+02 5.340e+02 7.897e+02 1.499e+03, threshold=1.068e+03, percent-clipped=10.0 2023-06-25 18:34:39,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1393692.0, ans=0.5 2023-06-25 18:35:03,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1393752.0, ans=0.125 2023-06-25 18:35:19,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1393812.0, ans=0.125 2023-06-25 18:35:31,316 INFO [train.py:996] (3/4) Epoch 8, batch 18850, loss[loss=0.1855, simple_loss=0.2451, pruned_loss=0.06297, over 21149.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2805, pruned_loss=0.06443, over 4247908.38 frames. ], batch size: 608, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:36:00,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.96 vs. limit=12.0 2023-06-25 18:36:18,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1393932.0, ans=0.125 2023-06-25 18:37:18,485 INFO [train.py:996] (3/4) Epoch 8, batch 18900, loss[loss=0.2168, simple_loss=0.2681, pruned_loss=0.08271, over 21438.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2768, pruned_loss=0.0639, over 4248961.48 frames. ], batch size: 476, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:37:45,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-25 18:37:56,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1394232.0, ans=0.0 2023-06-25 18:37:57,544 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:37:57,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1394232.0, ans=0.125 2023-06-25 18:38:05,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.99 vs. limit=5.0 2023-06-25 18:38:09,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 3.589e+02 4.833e+02 6.205e+02 1.384e+03, threshold=9.667e+02, percent-clipped=4.0 2023-06-25 18:38:25,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1394352.0, ans=0.1 2023-06-25 18:39:07,632 INFO [train.py:996] (3/4) Epoch 8, batch 18950, loss[loss=0.2076, simple_loss=0.2878, pruned_loss=0.0637, over 21344.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2803, pruned_loss=0.06627, over 4262751.92 frames. ], batch size: 159, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:39:27,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-25 18:40:45,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1394712.0, ans=0.0 2023-06-25 18:41:08,060 INFO [train.py:996] (3/4) Epoch 8, batch 19000, loss[loss=0.2421, simple_loss=0.3211, pruned_loss=0.08157, over 21921.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2897, pruned_loss=0.06772, over 4270011.31 frames. ], batch size: 372, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:41:28,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.40 vs. limit=8.0 2023-06-25 18:41:47,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.853e+02 4.722e+02 6.033e+02 9.741e+02 2.203e+03, threshold=1.207e+03, percent-clipped=24.0 2023-06-25 18:42:28,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1395012.0, ans=0.0 2023-06-25 18:42:56,842 INFO [train.py:996] (3/4) Epoch 8, batch 19050, loss[loss=0.2163, simple_loss=0.2846, pruned_loss=0.07402, over 21866.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2944, pruned_loss=0.07127, over 4277579.56 frames. ], batch size: 371, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:43:12,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1395132.0, ans=10.0 2023-06-25 18:43:50,747 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-25 18:43:57,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1395252.0, ans=0.125 2023-06-25 18:44:21,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1395312.0, ans=0.07 2023-06-25 18:44:44,094 INFO [train.py:996] (3/4) Epoch 8, batch 19100, loss[loss=0.1849, simple_loss=0.2552, pruned_loss=0.05735, over 21789.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2922, pruned_loss=0.07115, over 4282953.51 frames. ], batch size: 118, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:45:15,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1395432.0, ans=0.125 2023-06-25 18:45:19,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.078e+02 4.021e+02 4.752e+02 6.454e+02 2.086e+03, threshold=9.504e+02, percent-clipped=4.0 2023-06-25 18:45:22,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1395492.0, ans=10.0 2023-06-25 18:45:47,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1395552.0, ans=0.2 2023-06-25 18:45:50,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1395552.0, ans=0.125 2023-06-25 18:46:18,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1395612.0, ans=0.125 2023-06-25 18:46:30,724 INFO [train.py:996] (3/4) Epoch 8, batch 19150, loss[loss=0.333, simple_loss=0.4143, pruned_loss=0.1258, over 21497.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2922, pruned_loss=0.07123, over 4280993.86 frames. ], batch size: 471, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:46:31,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=22.5 2023-06-25 18:46:50,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1395732.0, ans=0.2 2023-06-25 18:46:53,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1395732.0, ans=0.125 2023-06-25 18:47:43,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1395852.0, ans=0.1 2023-06-25 18:48:02,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1395912.0, ans=0.125 2023-06-25 18:48:21,118 INFO [train.py:996] (3/4) Epoch 8, batch 19200, loss[loss=0.2181, simple_loss=0.3289, pruned_loss=0.05361, over 21654.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3024, pruned_loss=0.07218, over 4278423.49 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:48:47,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1396032.0, ans=0.2 2023-06-25 18:49:00,518 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.955e+02 4.242e+02 5.606e+02 9.141e+02 1.658e+03, threshold=1.121e+03, percent-clipped=22.0 2023-06-25 18:49:23,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-25 18:49:47,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1396212.0, ans=10.0 2023-06-25 18:49:48,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1396212.0, ans=0.125 2023-06-25 18:50:01,384 INFO [train.py:996] (3/4) Epoch 8, batch 19250, loss[loss=0.233, simple_loss=0.3232, pruned_loss=0.07135, over 21503.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.3022, pruned_loss=0.06767, over 4271978.45 frames. ], batch size: 507, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:51:44,988 INFO [train.py:996] (3/4) Epoch 8, batch 19300, loss[loss=0.1922, simple_loss=0.2733, pruned_loss=0.05553, over 21784.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2996, pruned_loss=0.06711, over 4280602.52 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:52:28,395 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 3.708e+02 5.763e+02 8.363e+02 1.771e+03, threshold=1.153e+03, percent-clipped=11.0 2023-06-25 18:53:37,185 INFO [train.py:996] (3/4) Epoch 8, batch 19350, loss[loss=0.2048, simple_loss=0.2716, pruned_loss=0.06902, over 21166.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.295, pruned_loss=0.06374, over 4278889.98 frames. ], batch size: 608, lr: 3.70e-03, grad_scale: 32.0 2023-06-25 18:53:47,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1396872.0, ans=0.2 2023-06-25 18:54:16,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1396992.0, ans=0.2 2023-06-25 18:54:26,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-25 18:54:28,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1396992.0, ans=0.125 2023-06-25 18:55:23,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1397172.0, ans=0.125 2023-06-25 18:55:25,372 INFO [train.py:996] (3/4) Epoch 8, batch 19400, loss[loss=0.2081, simple_loss=0.2928, pruned_loss=0.06168, over 21819.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2916, pruned_loss=0.06265, over 4285230.12 frames. ], batch size: 333, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:55:59,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1397232.0, ans=15.0 2023-06-25 18:56:07,068 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.659e+02 3.804e+02 4.878e+02 6.968e+02 1.951e+03, threshold=9.756e+02, percent-clipped=7.0 2023-06-25 18:56:07,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1397292.0, ans=0.125 2023-06-25 18:56:48,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-25 18:56:51,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1397352.0, ans=0.125 2023-06-25 18:56:56,784 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:57:02,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1397412.0, ans=0.0 2023-06-25 18:57:13,642 INFO [train.py:996] (3/4) Epoch 8, batch 19450, loss[loss=0.194, simple_loss=0.2743, pruned_loss=0.0569, over 20087.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2896, pruned_loss=0.06463, over 4290319.89 frames. ], batch size: 702, lr: 3.70e-03, grad_scale: 16.0 2023-06-25 18:57:14,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1397472.0, ans=0.0 2023-06-25 18:57:22,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1397472.0, ans=0.0 2023-06-25 18:58:03,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1397592.0, ans=0.0 2023-06-25 18:58:22,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1397592.0, ans=0.125 2023-06-25 18:58:28,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1397652.0, ans=0.2 2023-06-25 18:58:38,936 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-25 18:58:40,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1397652.0, ans=0.0 2023-06-25 18:58:42,205 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:58:54,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.31 vs. limit=5.0 2023-06-25 18:58:57,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1397712.0, ans=0.05 2023-06-25 18:59:01,892 INFO [train.py:996] (3/4) Epoch 8, batch 19500, loss[loss=0.1833, simple_loss=0.2333, pruned_loss=0.06664, over 20843.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2859, pruned_loss=0.06594, over 4287310.51 frames. ], batch size: 608, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 18:59:38,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1397832.0, ans=0.125 2023-06-25 18:59:47,841 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.224e+02 4.180e+02 5.667e+02 7.986e+02 1.317e+03, threshold=1.133e+03, percent-clipped=13.0 2023-06-25 19:00:19,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1397952.0, ans=0.125 2023-06-25 19:00:40,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1398012.0, ans=0.0 2023-06-25 19:00:42,902 INFO [train.py:996] (3/4) Epoch 8, batch 19550, loss[loss=0.1937, simple_loss=0.278, pruned_loss=0.05473, over 21141.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.283, pruned_loss=0.06514, over 4286168.66 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:01:05,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1398132.0, ans=0.125 2023-06-25 19:01:18,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1398132.0, ans=0.0 2023-06-25 19:01:58,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1398252.0, ans=0.125 2023-06-25 19:01:59,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1398252.0, ans=0.125 2023-06-25 19:02:23,098 INFO [train.py:996] (3/4) Epoch 8, batch 19600, loss[loss=0.2132, simple_loss=0.304, pruned_loss=0.06117, over 19811.00 frames. ], tot_loss[loss=0.208, simple_loss=0.284, pruned_loss=0.06604, over 4278021.90 frames. ], batch size: 704, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:02:46,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1398432.0, ans=0.125 2023-06-25 19:02:51,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1398432.0, ans=0.125 2023-06-25 19:02:56,359 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:03:10,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.701e+02 4.318e+02 6.046e+02 9.838e+02 1.787e+03, threshold=1.209e+03, percent-clipped=19.0 2023-06-25 19:03:20,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1398492.0, ans=0.2 2023-06-25 19:03:43,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1398552.0, ans=0.2 2023-06-25 19:03:50,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1398552.0, ans=0.2 2023-06-25 19:04:10,731 INFO [train.py:996] (3/4) Epoch 8, batch 19650, loss[loss=0.2225, simple_loss=0.289, pruned_loss=0.07802, over 20708.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2886, pruned_loss=0.06916, over 4276772.09 frames. ], batch size: 607, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:04:23,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1398672.0, ans=0.0 2023-06-25 19:04:25,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1398672.0, ans=0.125 2023-06-25 19:04:41,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1398732.0, ans=0.125 2023-06-25 19:06:07,307 INFO [train.py:996] (3/4) Epoch 8, batch 19700, loss[loss=0.1969, simple_loss=0.3124, pruned_loss=0.04075, over 20790.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2938, pruned_loss=0.07014, over 4273618.82 frames. ], batch size: 608, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:06:37,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1399032.0, ans=0.125 2023-06-25 19:06:55,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399032.0, ans=0.1 2023-06-25 19:07:03,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.726e+02 4.245e+02 5.228e+02 6.853e+02 1.147e+03, threshold=1.046e+03, percent-clipped=0.0 2023-06-25 19:07:17,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1399152.0, ans=0.07 2023-06-25 19:07:29,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1399152.0, ans=0.95 2023-06-25 19:07:46,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1399212.0, ans=0.2 2023-06-25 19:08:01,685 INFO [train.py:996] (3/4) Epoch 8, batch 19750, loss[loss=0.2294, simple_loss=0.3198, pruned_loss=0.06946, over 21643.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.302, pruned_loss=0.07138, over 4258791.61 frames. ], batch size: 263, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:08:44,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1399332.0, ans=0.0 2023-06-25 19:09:21,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-25 19:09:28,077 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-25 19:09:44,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1399512.0, ans=0.125 2023-06-25 19:09:54,487 INFO [train.py:996] (3/4) Epoch 8, batch 19800, loss[loss=0.2395, simple_loss=0.3051, pruned_loss=0.08696, over 21891.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3012, pruned_loss=0.0716, over 4266372.41 frames. ], batch size: 107, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:10:21,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-25 19:10:33,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399692.0, ans=0.1 2023-06-25 19:10:35,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1399692.0, ans=0.125 2023-06-25 19:10:38,208 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.249e+02 4.512e+02 5.932e+02 8.767e+02 2.271e+03, threshold=1.186e+03, percent-clipped=19.0 2023-06-25 19:10:42,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1399692.0, ans=0.125 2023-06-25 19:10:44,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=22.5 2023-06-25 19:10:56,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1399752.0, ans=0.1 2023-06-25 19:11:31,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1399812.0, ans=0.0 2023-06-25 19:11:42,794 INFO [train.py:996] (3/4) Epoch 8, batch 19850, loss[loss=0.2402, simple_loss=0.3174, pruned_loss=0.08152, over 21466.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2948, pruned_loss=0.0675, over 4267056.74 frames. ], batch size: 507, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:12:38,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1399992.0, ans=0.125 2023-06-25 19:12:40,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1399992.0, ans=0.125 2023-06-25 19:13:14,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1400112.0, ans=0.125 2023-06-25 19:13:15,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1400112.0, ans=0.125 2023-06-25 19:13:28,589 INFO [train.py:996] (3/4) Epoch 8, batch 19900, loss[loss=0.1874, simple_loss=0.265, pruned_loss=0.05493, over 21753.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2949, pruned_loss=0.06512, over 4271004.04 frames. ], batch size: 351, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:13:29,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1400172.0, ans=0.0 2023-06-25 19:13:40,626 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.88 vs. limit=15.0 2023-06-25 19:14:17,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.623e+02 3.554e+02 4.496e+02 7.903e+02 1.499e+03, threshold=8.992e+02, percent-clipped=4.0 2023-06-25 19:14:21,953 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-25 19:14:28,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1400292.0, ans=0.0 2023-06-25 19:15:05,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1400412.0, ans=10.0 2023-06-25 19:15:17,752 INFO [train.py:996] (3/4) Epoch 8, batch 19950, loss[loss=0.1882, simple_loss=0.2564, pruned_loss=0.05997, over 21662.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2886, pruned_loss=0.06524, over 4273298.65 frames. ], batch size: 282, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:15:51,108 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-25 19:16:11,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1400592.0, ans=0.07 2023-06-25 19:16:55,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1400712.0, ans=0.0 2023-06-25 19:17:05,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1400772.0, ans=0.125 2023-06-25 19:17:12,754 INFO [train.py:996] (3/4) Epoch 8, batch 20000, loss[loss=0.23, simple_loss=0.3113, pruned_loss=0.07434, over 21794.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2901, pruned_loss=0.06589, over 4267818.86 frames. ], batch size: 112, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:17:16,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1400772.0, ans=0.1 2023-06-25 19:17:25,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1400772.0, ans=0.125 2023-06-25 19:17:28,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1400832.0, ans=0.0 2023-06-25 19:17:55,681 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 3.942e+02 5.343e+02 7.186e+02 1.508e+03, threshold=1.069e+03, percent-clipped=12.0 2023-06-25 19:18:27,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1400952.0, ans=0.0 2023-06-25 19:18:58,665 INFO [train.py:996] (3/4) Epoch 8, batch 20050, loss[loss=0.2341, simple_loss=0.3018, pruned_loss=0.08324, over 21830.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2909, pruned_loss=0.06779, over 4278002.80 frames. ], batch size: 107, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:19:03,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1401072.0, ans=15.0 2023-06-25 19:19:06,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1401072.0, ans=0.2 2023-06-25 19:19:16,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1401132.0, ans=0.125 2023-06-25 19:19:47,141 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:20:48,060 INFO [train.py:996] (3/4) Epoch 8, batch 20100, loss[loss=0.1947, simple_loss=0.259, pruned_loss=0.06519, over 17216.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2923, pruned_loss=0.06928, over 4283949.84 frames. ], batch size: 61, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:21:19,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1401432.0, ans=0.125 2023-06-25 19:21:29,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1401492.0, ans=15.0 2023-06-25 19:21:33,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.930e+02 3.809e+02 4.961e+02 6.304e+02 1.570e+03, threshold=9.921e+02, percent-clipped=3.0 2023-06-25 19:21:34,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1401492.0, ans=0.1 2023-06-25 19:22:38,489 INFO [train.py:996] (3/4) Epoch 8, batch 20150, loss[loss=0.2029, simple_loss=0.2524, pruned_loss=0.07676, over 20040.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3024, pruned_loss=0.07275, over 4281784.69 frames. ], batch size: 704, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:23:04,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1401732.0, ans=0.2 2023-06-25 19:24:01,317 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-25 19:24:02,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1401852.0, ans=0.04949747468305833 2023-06-25 19:24:17,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1401912.0, ans=0.0 2023-06-25 19:24:35,465 INFO [train.py:996] (3/4) Epoch 8, batch 20200, loss[loss=0.2473, simple_loss=0.3443, pruned_loss=0.07513, over 21833.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3085, pruned_loss=0.07577, over 4283035.20 frames. ], batch size: 316, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:24:53,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1401972.0, ans=0.125 2023-06-25 19:24:53,178 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:25:12,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-25 19:25:16,744 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-25 19:25:25,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.344e+02 4.250e+02 5.853e+02 8.923e+02 1.822e+03, threshold=1.171e+03, percent-clipped=17.0 2023-06-25 19:25:51,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1402152.0, ans=0.0 2023-06-25 19:26:08,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1402212.0, ans=0.2 2023-06-25 19:26:10,556 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.52 vs. limit=15.0 2023-06-25 19:26:23,002 INFO [train.py:996] (3/4) Epoch 8, batch 20250, loss[loss=0.2535, simple_loss=0.3338, pruned_loss=0.08667, over 21588.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3102, pruned_loss=0.07503, over 4284283.32 frames. ], batch size: 471, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:27:00,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1402332.0, ans=0.0 2023-06-25 19:27:05,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1402332.0, ans=0.125 2023-06-25 19:27:06,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-25 19:27:07,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1402392.0, ans=0.0 2023-06-25 19:27:22,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1402392.0, ans=0.2 2023-06-25 19:27:23,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-25 19:27:37,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-25 19:28:00,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1402512.0, ans=0.5 2023-06-25 19:28:15,776 INFO [train.py:996] (3/4) Epoch 8, batch 20300, loss[loss=0.2102, simple_loss=0.3004, pruned_loss=0.06002, over 21703.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3081, pruned_loss=0.0722, over 4276623.24 frames. ], batch size: 298, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:28:45,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1402632.0, ans=0.125 2023-06-25 19:28:58,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.664e+02 3.704e+02 5.083e+02 7.003e+02 2.093e+03, threshold=1.017e+03, percent-clipped=9.0 2023-06-25 19:29:22,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-25 19:29:30,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1402752.0, ans=0.125 2023-06-25 19:29:56,321 INFO [train.py:996] (3/4) Epoch 8, batch 20350, loss[loss=0.2292, simple_loss=0.3054, pruned_loss=0.07654, over 21873.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3079, pruned_loss=0.07232, over 4264140.94 frames. ], batch size: 124, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:30:13,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1402872.0, ans=0.0 2023-06-25 19:30:51,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=12.0 2023-06-25 19:31:09,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1403052.0, ans=0.125 2023-06-25 19:31:43,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1403172.0, ans=0.125 2023-06-25 19:31:44,534 INFO [train.py:996] (3/4) Epoch 8, batch 20400, loss[loss=0.2766, simple_loss=0.3545, pruned_loss=0.09932, over 21711.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3099, pruned_loss=0.07464, over 4246049.83 frames. ], batch size: 414, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:31:56,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.62 vs. limit=15.0 2023-06-25 19:32:05,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1403172.0, ans=0.1 2023-06-25 19:32:21,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1403232.0, ans=0.0 2023-06-25 19:32:33,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.980e+02 4.155e+02 6.028e+02 7.732e+02 1.561e+03, threshold=1.206e+03, percent-clipped=8.0 2023-06-25 19:32:46,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1403292.0, ans=0.0 2023-06-25 19:32:50,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1403352.0, ans=0.0 2023-06-25 19:33:18,854 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-25 19:33:31,066 INFO [train.py:996] (3/4) Epoch 8, batch 20450, loss[loss=0.2035, simple_loss=0.2612, pruned_loss=0.07291, over 19961.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3102, pruned_loss=0.0763, over 4229994.37 frames. ], batch size: 704, lr: 3.69e-03, grad_scale: 32.0 2023-06-25 19:33:33,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1403472.0, ans=0.0 2023-06-25 19:34:01,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1403532.0, ans=0.1 2023-06-25 19:34:03,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1403532.0, ans=0.125 2023-06-25 19:34:13,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1403592.0, ans=0.125 2023-06-25 19:34:16,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1403592.0, ans=0.04949747468305833 2023-06-25 19:35:16,948 INFO [train.py:996] (3/4) Epoch 8, batch 20500, loss[loss=0.2175, simple_loss=0.2806, pruned_loss=0.07726, over 21300.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.305, pruned_loss=0.07617, over 4232897.61 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:35:54,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1403832.0, ans=0.125 2023-06-25 19:36:07,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.072e+02 6.125e+02 8.287e+02 1.348e+03, threshold=1.225e+03, percent-clipped=6.0 2023-06-25 19:36:57,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1404012.0, ans=0.2 2023-06-25 19:36:58,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1404012.0, ans=0.1 2023-06-25 19:37:09,444 INFO [train.py:996] (3/4) Epoch 8, batch 20550, loss[loss=0.1917, simple_loss=0.2674, pruned_loss=0.05801, over 21216.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2984, pruned_loss=0.0747, over 4239556.30 frames. ], batch size: 143, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:37:46,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1404132.0, ans=0.125 2023-06-25 19:37:48,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1404192.0, ans=0.125 2023-06-25 19:38:15,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1404252.0, ans=0.2 2023-06-25 19:38:52,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1404312.0, ans=0.125 2023-06-25 19:38:56,989 INFO [train.py:996] (3/4) Epoch 8, batch 20600, loss[loss=0.232, simple_loss=0.2971, pruned_loss=0.08341, over 21461.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3026, pruned_loss=0.07346, over 4244021.67 frames. ], batch size: 194, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:39:19,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1404432.0, ans=0.0 2023-06-25 19:39:42,064 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.887e+02 4.920e+02 7.013e+02 1.215e+03 1.791e+03, threshold=1.403e+03, percent-clipped=24.0 2023-06-25 19:40:42,117 INFO [train.py:996] (3/4) Epoch 8, batch 20650, loss[loss=0.1988, simple_loss=0.2631, pruned_loss=0.06719, over 21732.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2986, pruned_loss=0.07389, over 4255782.96 frames. ], batch size: 282, lr: 3.69e-03, grad_scale: 16.0 2023-06-25 19:40:59,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1404672.0, ans=0.125 2023-06-25 19:41:43,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1404852.0, ans=0.0 2023-06-25 19:41:55,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1404852.0, ans=0.0 2023-06-25 19:41:57,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-25 19:42:14,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1404912.0, ans=0.1 2023-06-25 19:42:31,275 INFO [train.py:996] (3/4) Epoch 8, batch 20700, loss[loss=0.1967, simple_loss=0.2718, pruned_loss=0.06082, over 21766.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2895, pruned_loss=0.0703, over 4259429.50 frames. ], batch size: 282, lr: 3.69e-03, grad_scale: 8.0 2023-06-25 19:43:27,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.647e+02 4.600e+02 6.617e+02 1.302e+03, threshold=9.199e+02, percent-clipped=0.0 2023-06-25 19:44:26,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1405272.0, ans=0.125 2023-06-25 19:44:27,717 INFO [train.py:996] (3/4) Epoch 8, batch 20750, loss[loss=0.1597, simple_loss=0.2182, pruned_loss=0.05063, over 18286.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2928, pruned_loss=0.07023, over 4258090.85 frames. ], batch size: 70, lr: 3.69e-03, grad_scale: 8.0 2023-06-25 19:44:32,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1405272.0, ans=0.125 2023-06-25 19:45:06,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-25 19:45:21,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1405392.0, ans=0.125 2023-06-25 19:45:41,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1405452.0, ans=0.1 2023-06-25 19:45:42,481 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-06-25 19:46:04,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-25 19:46:16,219 INFO [train.py:996] (3/4) Epoch 8, batch 20800, loss[loss=0.2216, simple_loss=0.2842, pruned_loss=0.07947, over 21526.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2959, pruned_loss=0.07101, over 4262092.77 frames. ], batch size: 414, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:46:30,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1405572.0, ans=0.0 2023-06-25 19:46:39,738 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:47:10,349 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 4.318e+02 7.506e+02 1.059e+03 2.434e+03, threshold=1.501e+03, percent-clipped=34.0 2023-06-25 19:47:48,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-25 19:47:53,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1405812.0, ans=0.125 2023-06-25 19:47:56,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1405812.0, ans=0.04949747468305833 2023-06-25 19:48:00,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1405812.0, ans=0.1 2023-06-25 19:48:02,815 INFO [train.py:996] (3/4) Epoch 8, batch 20850, loss[loss=0.2238, simple_loss=0.292, pruned_loss=0.0778, over 21714.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2894, pruned_loss=0.06923, over 4261955.59 frames. ], batch size: 441, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:48:40,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1405992.0, ans=0.0 2023-06-25 19:48:51,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1405992.0, ans=0.125 2023-06-25 19:49:10,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1406052.0, ans=0.2 2023-06-25 19:49:34,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1406112.0, ans=0.125 2023-06-25 19:49:42,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=15.0 2023-06-25 19:49:47,596 INFO [train.py:996] (3/4) Epoch 8, batch 20900, loss[loss=0.2256, simple_loss=0.3044, pruned_loss=0.07343, over 21845.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2897, pruned_loss=0.07003, over 4271520.78 frames. ], batch size: 351, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:49:57,250 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-25 19:50:34,298 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.696e+02 3.719e+02 4.943e+02 7.397e+02 1.417e+03, threshold=9.886e+02, percent-clipped=0.0 2023-06-25 19:51:12,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.76 vs. limit=22.5 2023-06-25 19:51:23,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.20 vs. limit=10.0 2023-06-25 19:51:30,970 INFO [train.py:996] (3/4) Epoch 8, batch 20950, loss[loss=0.1722, simple_loss=0.2531, pruned_loss=0.04567, over 21437.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2848, pruned_loss=0.06662, over 4264475.58 frames. ], batch size: 211, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:51:37,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-25 19:51:44,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=12.0 2023-06-25 19:51:53,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1406532.0, ans=0.0 2023-06-25 19:51:55,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1406532.0, ans=0.1 2023-06-25 19:52:17,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1406592.0, ans=0.125 2023-06-25 19:52:21,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1406592.0, ans=0.035 2023-06-25 19:52:41,149 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.13 vs. limit=6.0 2023-06-25 19:52:47,552 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-25 19:53:11,618 INFO [train.py:996] (3/4) Epoch 8, batch 21000, loss[loss=0.2347, simple_loss=0.3541, pruned_loss=0.05766, over 19776.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2842, pruned_loss=0.06658, over 4252782.84 frames. ], batch size: 702, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:53:11,619 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 19:53:31,260 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2635, simple_loss=0.3595, pruned_loss=0.08373, over 1796401.00 frames. 2023-06-25 19:53:31,261 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 19:54:24,844 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 3.578e+02 4.486e+02 7.087e+02 1.717e+03, threshold=8.972e+02, percent-clipped=7.0 2023-06-25 19:54:32,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1406952.0, ans=0.0 2023-06-25 19:54:36,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1406952.0, ans=0.125 2023-06-25 19:54:46,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1406952.0, ans=0.125 2023-06-25 19:55:17,228 INFO [train.py:996] (3/4) Epoch 8, batch 21050, loss[loss=0.2408, simple_loss=0.2816, pruned_loss=0.09997, over 21425.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2829, pruned_loss=0.06694, over 4254271.33 frames. ], batch size: 509, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:56:07,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1407192.0, ans=0.125 2023-06-25 19:56:58,639 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:57:00,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1407312.0, ans=0.125 2023-06-25 19:57:02,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1407312.0, ans=0.0 2023-06-25 19:57:05,032 INFO [train.py:996] (3/4) Epoch 8, batch 21100, loss[loss=0.1706, simple_loss=0.2724, pruned_loss=0.03442, over 19818.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.28, pruned_loss=0.06659, over 4230154.59 frames. ], batch size: 703, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:57:43,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1407492.0, ans=0.125 2023-06-25 19:57:45,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1407492.0, ans=0.2 2023-06-25 19:57:50,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1407492.0, ans=0.125 2023-06-25 19:57:57,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.898e+02 4.201e+02 5.635e+02 7.939e+02 1.482e+03, threshold=1.127e+03, percent-clipped=15.0 2023-06-25 19:58:49,889 INFO [train.py:996] (3/4) Epoch 8, batch 21150, loss[loss=0.1995, simple_loss=0.2584, pruned_loss=0.07031, over 21681.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2765, pruned_loss=0.06674, over 4223447.91 frames. ], batch size: 333, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 19:58:59,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1407672.0, ans=0.0 2023-06-25 19:59:04,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1407672.0, ans=0.125 2023-06-25 19:59:05,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1407732.0, ans=0.125 2023-06-25 19:59:06,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.04 vs. limit=10.0 2023-06-25 19:59:30,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-25 19:59:35,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1407792.0, ans=0.0 2023-06-25 19:59:37,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1407792.0, ans=0.125 2023-06-25 19:59:47,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1407792.0, ans=0.2 2023-06-25 20:00:16,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1407912.0, ans=0.1 2023-06-25 20:00:38,491 INFO [train.py:996] (3/4) Epoch 8, batch 21200, loss[loss=0.1572, simple_loss=0.2298, pruned_loss=0.04232, over 20779.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2735, pruned_loss=0.06661, over 4238743.63 frames. ], batch size: 608, lr: 3.68e-03, grad_scale: 32.0 2023-06-25 20:00:42,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1407972.0, ans=0.125 2023-06-25 20:00:45,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1407972.0, ans=0.1 2023-06-25 20:01:34,475 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.940e+02 3.823e+02 4.703e+02 6.840e+02 1.518e+03, threshold=9.406e+02, percent-clipped=1.0 2023-06-25 20:02:13,238 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-25 20:02:16,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.71 vs. limit=10.0 2023-06-25 20:02:26,006 INFO [train.py:996] (3/4) Epoch 8, batch 21250, loss[loss=0.2276, simple_loss=0.2978, pruned_loss=0.07872, over 21632.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2713, pruned_loss=0.06635, over 4240458.48 frames. ], batch size: 391, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:02:31,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1408272.0, ans=0.2 2023-06-25 20:02:36,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1408272.0, ans=0.1 2023-06-25 20:02:43,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-25 20:03:35,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1408452.0, ans=0.2 2023-06-25 20:04:11,918 INFO [train.py:996] (3/4) Epoch 8, batch 21300, loss[loss=0.2184, simple_loss=0.2884, pruned_loss=0.07416, over 21372.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2772, pruned_loss=0.06762, over 4244574.27 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:04:17,189 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-25 20:04:23,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1408572.0, ans=0.2 2023-06-25 20:04:41,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1408632.0, ans=0.125 2023-06-25 20:05:07,867 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.197e+02 4.370e+02 6.934e+02 9.057e+02 1.727e+03, threshold=1.387e+03, percent-clipped=23.0 2023-06-25 20:05:46,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-25 20:05:51,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1408812.0, ans=0.125 2023-06-25 20:05:58,747 INFO [train.py:996] (3/4) Epoch 8, batch 21350, loss[loss=0.2065, simple_loss=0.3024, pruned_loss=0.05531, over 21645.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2818, pruned_loss=0.06805, over 4259507.91 frames. ], batch size: 389, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:06:46,557 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-25 20:07:29,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1409112.0, ans=0.1 2023-06-25 20:07:45,953 INFO [train.py:996] (3/4) Epoch 8, batch 21400, loss[loss=0.22, simple_loss=0.3047, pruned_loss=0.06764, over 21830.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2862, pruned_loss=0.06835, over 4268365.36 frames. ], batch size: 371, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:08:36,713 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:08:46,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.047e+02 3.806e+02 5.030e+02 6.995e+02 1.894e+03, threshold=1.006e+03, percent-clipped=5.0 2023-06-25 20:09:32,611 INFO [train.py:996] (3/4) Epoch 8, batch 21450, loss[loss=0.227, simple_loss=0.294, pruned_loss=0.08001, over 21599.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2888, pruned_loss=0.07018, over 4266403.97 frames. ], batch size: 548, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:09:52,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1409472.0, ans=0.125 2023-06-25 20:09:59,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1409532.0, ans=0.0 2023-06-25 20:10:19,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1409532.0, ans=0.0 2023-06-25 20:10:22,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1409592.0, ans=0.04949747468305833 2023-06-25 20:10:26,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1409592.0, ans=0.125 2023-06-25 20:10:48,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1409652.0, ans=0.2 2023-06-25 20:11:20,624 INFO [train.py:996] (3/4) Epoch 8, batch 21500, loss[loss=0.1953, simple_loss=0.2625, pruned_loss=0.06405, over 21678.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2881, pruned_loss=0.07103, over 4260103.03 frames. ], batch size: 333, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:11:21,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=22.5 2023-06-25 20:11:50,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1409832.0, ans=0.2 2023-06-25 20:12:04,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1409832.0, ans=0.2 2023-06-25 20:12:12,754 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:12:14,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1409892.0, ans=0.2 2023-06-25 20:12:25,074 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.002e+02 3.682e+02 4.429e+02 6.594e+02 1.934e+03, threshold=8.857e+02, percent-clipped=12.0 2023-06-25 20:13:05,243 INFO [train.py:996] (3/4) Epoch 8, batch 21550, loss[loss=0.1471, simple_loss=0.2253, pruned_loss=0.03451, over 21649.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2813, pruned_loss=0.06824, over 4271410.04 frames. ], batch size: 298, lr: 3.68e-03, grad_scale: 8.0 2023-06-25 20:14:53,572 INFO [train.py:996] (3/4) Epoch 8, batch 21600, loss[loss=0.1875, simple_loss=0.2678, pruned_loss=0.05358, over 21377.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2778, pruned_loss=0.06769, over 4261985.87 frames. ], batch size: 211, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:15:21,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-25 20:16:01,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1410492.0, ans=0.07 2023-06-25 20:16:02,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.883e+02 3.709e+02 4.996e+02 7.825e+02 2.196e+03, threshold=9.991e+02, percent-clipped=18.0 2023-06-25 20:16:23,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1410612.0, ans=0.125 2023-06-25 20:16:46,682 INFO [train.py:996] (3/4) Epoch 8, batch 21650, loss[loss=0.2051, simple_loss=0.285, pruned_loss=0.06262, over 21198.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.281, pruned_loss=0.06568, over 4260263.70 frames. ], batch size: 159, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:16:57,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1410672.0, ans=0.125 2023-06-25 20:17:08,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1410732.0, ans=0.125 2023-06-25 20:17:49,155 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:17:52,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1410852.0, ans=0.0 2023-06-25 20:18:25,951 INFO [train.py:996] (3/4) Epoch 8, batch 21700, loss[loss=0.2176, simple_loss=0.2835, pruned_loss=0.0759, over 21551.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2801, pruned_loss=0.06419, over 4248577.21 frames. ], batch size: 414, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:19:33,109 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.665e+02 3.626e+02 5.313e+02 7.928e+02 1.804e+03, threshold=1.063e+03, percent-clipped=12.0 2023-06-25 20:19:37,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1411152.0, ans=0.2 2023-06-25 20:19:47,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1411152.0, ans=0.125 2023-06-25 20:20:12,837 INFO [train.py:996] (3/4) Epoch 8, batch 21750, loss[loss=0.1945, simple_loss=0.2517, pruned_loss=0.0687, over 21477.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2778, pruned_loss=0.06357, over 4251462.99 frames. ], batch size: 195, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:20:22,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1411272.0, ans=0.125 2023-06-25 20:20:54,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1411332.0, ans=0.0 2023-06-25 20:21:04,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1411392.0, ans=0.125 2023-06-25 20:21:05,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=15.0 2023-06-25 20:21:17,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1411392.0, ans=0.125 2023-06-25 20:22:07,465 INFO [train.py:996] (3/4) Epoch 8, batch 21800, loss[loss=0.2423, simple_loss=0.3295, pruned_loss=0.07757, over 21766.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2761, pruned_loss=0.06475, over 4253056.97 frames. ], batch size: 333, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:22:47,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1411632.0, ans=0.1 2023-06-25 20:22:55,902 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-25 20:23:10,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.928e+02 3.851e+02 5.673e+02 8.450e+02 2.187e+03, threshold=1.135e+03, percent-clipped=14.0 2023-06-25 20:23:12,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1411692.0, ans=0.1 2023-06-25 20:23:33,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-25 20:23:53,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1411872.0, ans=0.1 2023-06-25 20:23:54,654 INFO [train.py:996] (3/4) Epoch 8, batch 21850, loss[loss=0.2402, simple_loss=0.3078, pruned_loss=0.08631, over 21791.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2835, pruned_loss=0.06575, over 4246596.43 frames. ], batch size: 441, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:25:11,266 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-06-25 20:25:15,711 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:25:44,254 INFO [train.py:996] (3/4) Epoch 8, batch 21900, loss[loss=0.182, simple_loss=0.2507, pruned_loss=0.05664, over 21343.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2846, pruned_loss=0.06677, over 4262933.97 frames. ], batch size: 131, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:25:46,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1412172.0, ans=0.125 2023-06-25 20:26:03,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1412172.0, ans=0.1 2023-06-25 20:26:17,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1412232.0, ans=0.1 2023-06-25 20:26:35,799 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.10 vs. limit=22.5 2023-06-25 20:26:36,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1412292.0, ans=0.125 2023-06-25 20:26:45,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1412292.0, ans=0.0 2023-06-25 20:26:46,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.095e+02 4.152e+02 5.797e+02 7.520e+02 1.468e+03, threshold=1.159e+03, percent-clipped=2.0 2023-06-25 20:27:36,363 INFO [train.py:996] (3/4) Epoch 8, batch 21950, loss[loss=0.169, simple_loss=0.2373, pruned_loss=0.0504, over 21777.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2792, pruned_loss=0.06559, over 4254012.85 frames. ], batch size: 107, lr: 3.68e-03, grad_scale: 16.0 2023-06-25 20:28:12,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1412532.0, ans=0.125 2023-06-25 20:28:37,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1412652.0, ans=0.125 2023-06-25 20:28:51,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1412652.0, ans=0.125 2023-06-25 20:29:25,469 INFO [train.py:996] (3/4) Epoch 8, batch 22000, loss[loss=0.2156, simple_loss=0.3046, pruned_loss=0.06328, over 21228.00 frames. ], tot_loss[loss=0.2015, simple_loss=0.2751, pruned_loss=0.06399, over 4252198.55 frames. ], batch size: 549, lr: 3.68e-03, grad_scale: 32.0 2023-06-25 20:29:43,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1412772.0, ans=0.0 2023-06-25 20:29:43,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1412772.0, ans=0.0 2023-06-25 20:30:23,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.905e+02 5.232e+02 7.810e+02 2.335e+03, threshold=1.046e+03, percent-clipped=14.0 2023-06-25 20:30:30,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1412952.0, ans=0.1 2023-06-25 20:31:13,842 INFO [train.py:996] (3/4) Epoch 8, batch 22050, loss[loss=0.2597, simple_loss=0.3426, pruned_loss=0.0884, over 21911.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2797, pruned_loss=0.065, over 4250909.18 frames. ], batch size: 372, lr: 3.67e-03, grad_scale: 32.0 2023-06-25 20:31:14,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1413072.0, ans=0.1 2023-06-25 20:31:45,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1413132.0, ans=0.5 2023-06-25 20:32:08,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1413192.0, ans=0.0 2023-06-25 20:32:35,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1413252.0, ans=0.2 2023-06-25 20:32:35,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1413252.0, ans=0.05 2023-06-25 20:32:51,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1413312.0, ans=0.125 2023-06-25 20:33:02,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1413372.0, ans=15.0 2023-06-25 20:33:02,599 INFO [train.py:996] (3/4) Epoch 8, batch 22100, loss[loss=0.235, simple_loss=0.3165, pruned_loss=0.07678, over 21711.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2888, pruned_loss=0.06874, over 4251963.40 frames. ], batch size: 298, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:33:55,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1413492.0, ans=0.1 2023-06-25 20:34:00,123 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.256e+02 4.552e+02 6.727e+02 1.040e+03 2.213e+03, threshold=1.345e+03, percent-clipped=23.0 2023-06-25 20:34:12,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1413552.0, ans=0.0 2023-06-25 20:34:47,946 INFO [train.py:996] (3/4) Epoch 8, batch 22150, loss[loss=0.2343, simple_loss=0.2998, pruned_loss=0.08447, over 21200.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2911, pruned_loss=0.06964, over 4261114.63 frames. ], batch size: 143, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:34:55,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1413672.0, ans=0.0 2023-06-25 20:36:26,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1413912.0, ans=0.2 2023-06-25 20:36:35,741 INFO [train.py:996] (3/4) Epoch 8, batch 22200, loss[loss=0.2776, simple_loss=0.3637, pruned_loss=0.09573, over 21568.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2929, pruned_loss=0.07116, over 4270590.43 frames. ], batch size: 471, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:36:57,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1414032.0, ans=0.0 2023-06-25 20:37:25,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1414092.0, ans=0.1 2023-06-25 20:37:29,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=22.5 2023-06-25 20:37:29,746 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.047e+02 4.294e+02 5.583e+02 8.306e+02 1.665e+03, threshold=1.117e+03, percent-clipped=3.0 2023-06-25 20:37:31,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1414152.0, ans=0.125 2023-06-25 20:38:23,349 INFO [train.py:996] (3/4) Epoch 8, batch 22250, loss[loss=0.2463, simple_loss=0.3239, pruned_loss=0.08435, over 21339.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2988, pruned_loss=0.07273, over 4275630.31 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:38:59,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1414332.0, ans=0.125 2023-06-25 20:39:13,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1414392.0, ans=0.0 2023-06-25 20:39:55,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1414512.0, ans=0.125 2023-06-25 20:40:04,008 INFO [train.py:996] (3/4) Epoch 8, batch 22300, loss[loss=0.2105, simple_loss=0.2789, pruned_loss=0.07109, over 21688.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3005, pruned_loss=0.07443, over 4271934.09 frames. ], batch size: 263, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:40:37,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-25 20:40:49,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1414692.0, ans=0.125 2023-06-25 20:40:57,263 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.146e+02 4.093e+02 5.360e+02 7.335e+02 1.399e+03, threshold=1.072e+03, percent-clipped=5.0 2023-06-25 20:41:28,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1414812.0, ans=0.125 2023-06-25 20:41:31,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1414812.0, ans=0.125 2023-06-25 20:41:38,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1414812.0, ans=0.125 2023-06-25 20:41:51,931 INFO [train.py:996] (3/4) Epoch 8, batch 22350, loss[loss=0.204, simple_loss=0.2701, pruned_loss=0.06891, over 21033.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2992, pruned_loss=0.07483, over 4279387.29 frames. ], batch size: 607, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:43:04,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1415052.0, ans=0.07 2023-06-25 20:43:15,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1415112.0, ans=0.125 2023-06-25 20:43:38,700 INFO [train.py:996] (3/4) Epoch 8, batch 22400, loss[loss=0.1961, simple_loss=0.2729, pruned_loss=0.05964, over 21430.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2957, pruned_loss=0.07283, over 4283036.81 frames. ], batch size: 212, lr: 3.67e-03, grad_scale: 32.0 2023-06-25 20:43:57,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-25 20:44:21,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-25 20:44:34,051 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.012e+02 4.038e+02 6.138e+02 7.809e+02 1.292e+03, threshold=1.228e+03, percent-clipped=3.0 2023-06-25 20:44:56,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1415412.0, ans=0.0 2023-06-25 20:45:08,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1415412.0, ans=0.125 2023-06-25 20:45:14,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1415412.0, ans=0.05 2023-06-25 20:45:23,611 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.17 vs. limit=10.0 2023-06-25 20:45:25,813 INFO [train.py:996] (3/4) Epoch 8, batch 22450, loss[loss=0.2083, simple_loss=0.268, pruned_loss=0.07429, over 16231.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.289, pruned_loss=0.07138, over 4281277.00 frames. ], batch size: 66, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:45:48,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1415532.0, ans=10.0 2023-06-25 20:46:37,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1415652.0, ans=0.125 2023-06-25 20:46:53,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-25 20:46:57,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1415712.0, ans=0.0 2023-06-25 20:47:12,139 INFO [train.py:996] (3/4) Epoch 8, batch 22500, loss[loss=0.2222, simple_loss=0.317, pruned_loss=0.06373, over 21652.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2849, pruned_loss=0.07068, over 4274718.73 frames. ], batch size: 247, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:47:16,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1415772.0, ans=0.125 2023-06-25 20:47:52,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1415892.0, ans=0.09899494936611666 2023-06-25 20:47:52,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1415892.0, ans=0.125 2023-06-25 20:48:12,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-25 20:48:14,378 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.749e+02 3.949e+02 4.919e+02 7.887e+02 2.030e+03, threshold=9.838e+02, percent-clipped=13.0 2023-06-25 20:48:55,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-25 20:49:01,280 INFO [train.py:996] (3/4) Epoch 8, batch 22550, loss[loss=0.2148, simple_loss=0.296, pruned_loss=0.06685, over 21795.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2884, pruned_loss=0.07044, over 4282442.02 frames. ], batch size: 298, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:49:07,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1416072.0, ans=0.0 2023-06-25 20:50:18,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1416252.0, ans=0.1 2023-06-25 20:50:35,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1416312.0, ans=0.125 2023-06-25 20:50:43,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1416312.0, ans=0.1 2023-06-25 20:50:52,279 INFO [train.py:996] (3/4) Epoch 8, batch 22600, loss[loss=0.2377, simple_loss=0.338, pruned_loss=0.06872, over 21231.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2914, pruned_loss=0.07076, over 4279195.36 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:51:35,390 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:52:04,660 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.075e+02 4.518e+02 6.028e+02 9.364e+02 1.882e+03, threshold=1.206e+03, percent-clipped=21.0 2023-06-25 20:52:12,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-25 20:52:15,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1416552.0, ans=0.2 2023-06-25 20:52:38,679 INFO [train.py:996] (3/4) Epoch 8, batch 22650, loss[loss=0.1822, simple_loss=0.2442, pruned_loss=0.06016, over 21464.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2888, pruned_loss=0.07084, over 4258086.42 frames. ], batch size: 195, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:53:29,214 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:53:29,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1416792.0, ans=0.125 2023-06-25 20:53:34,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1416792.0, ans=0.125 2023-06-25 20:54:20,824 INFO [train.py:996] (3/4) Epoch 8, batch 22700, loss[loss=0.1826, simple_loss=0.2478, pruned_loss=0.0587, over 21219.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.284, pruned_loss=0.07139, over 4268881.69 frames. ], batch size: 176, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:55:33,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.950e+02 3.999e+02 5.550e+02 8.694e+02 1.659e+03, threshold=1.110e+03, percent-clipped=6.0 2023-06-25 20:55:37,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1417152.0, ans=0.1 2023-06-25 20:55:41,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1417152.0, ans=0.1 2023-06-25 20:56:05,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1417212.0, ans=15.0 2023-06-25 20:56:08,907 INFO [train.py:996] (3/4) Epoch 8, batch 22750, loss[loss=0.1868, simple_loss=0.2314, pruned_loss=0.0711, over 20689.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2847, pruned_loss=0.07283, over 4254522.47 frames. ], batch size: 607, lr: 3.67e-03, grad_scale: 8.0 2023-06-25 20:56:19,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1417272.0, ans=0.5 2023-06-25 20:56:54,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1417392.0, ans=0.125 2023-06-25 20:57:55,434 INFO [train.py:996] (3/4) Epoch 8, batch 22800, loss[loss=0.2307, simple_loss=0.3049, pruned_loss=0.07824, over 21877.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2872, pruned_loss=0.07392, over 4263710.18 frames. ], batch size: 107, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:58:14,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-25 20:58:34,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1417632.0, ans=0.1 2023-06-25 20:58:52,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1417692.0, ans=0.125 2023-06-25 20:59:04,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1417692.0, ans=0.125 2023-06-25 20:59:05,984 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:59:06,879 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.216e+02 4.609e+02 5.638e+02 8.633e+02 1.980e+03, threshold=1.128e+03, percent-clipped=10.0 2023-06-25 20:59:26,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1417812.0, ans=0.125 2023-06-25 20:59:38,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1417812.0, ans=0.1 2023-06-25 20:59:41,021 INFO [train.py:996] (3/4) Epoch 8, batch 22850, loss[loss=0.2017, simple_loss=0.2654, pruned_loss=0.06897, over 21848.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2841, pruned_loss=0.07303, over 4260490.25 frames. ], batch size: 118, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 20:59:43,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1417872.0, ans=0.125 2023-06-25 21:00:29,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.80 vs. limit=15.0 2023-06-25 21:00:40,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-25 21:00:57,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-25 21:01:15,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1418112.0, ans=0.2 2023-06-25 21:01:18,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1418112.0, ans=0.2 2023-06-25 21:01:30,608 INFO [train.py:996] (3/4) Epoch 8, batch 22900, loss[loss=0.1462, simple_loss=0.2016, pruned_loss=0.04539, over 16154.00 frames. ], tot_loss[loss=0.215, simple_loss=0.286, pruned_loss=0.07205, over 4254816.16 frames. ], batch size: 60, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:01:40,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1418172.0, ans=0.05 2023-06-25 21:02:45,749 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.594e+02 4.620e+02 6.877e+02 1.071e+03 2.318e+03, threshold=1.375e+03, percent-clipped=23.0 2023-06-25 21:03:02,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1418412.0, ans=0.1 2023-06-25 21:03:07,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1418412.0, ans=0.0 2023-06-25 21:03:25,358 INFO [train.py:996] (3/4) Epoch 8, batch 22950, loss[loss=0.204, simple_loss=0.3098, pruned_loss=0.04914, over 21450.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2983, pruned_loss=0.0711, over 4260247.45 frames. ], batch size: 211, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:04:08,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1418532.0, ans=0.125 2023-06-25 21:04:12,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1418532.0, ans=0.125 2023-06-25 21:04:33,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1418652.0, ans=0.0 2023-06-25 21:05:12,930 INFO [train.py:996] (3/4) Epoch 8, batch 23000, loss[loss=0.2292, simple_loss=0.2924, pruned_loss=0.08298, over 21538.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2981, pruned_loss=0.06876, over 4261395.96 frames. ], batch size: 548, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:05:13,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1418772.0, ans=0.2 2023-06-25 21:06:10,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.651e+02 4.060e+02 5.403e+02 8.584e+02 1.736e+03, threshold=1.081e+03, percent-clipped=10.0 2023-06-25 21:06:24,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=22.5 2023-06-25 21:06:44,800 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-25 21:06:47,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1419012.0, ans=0.125 2023-06-25 21:06:53,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1419012.0, ans=0.125 2023-06-25 21:06:55,853 INFO [train.py:996] (3/4) Epoch 8, batch 23050, loss[loss=0.2299, simple_loss=0.3061, pruned_loss=0.07682, over 21450.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3002, pruned_loss=0.0712, over 4270851.45 frames. ], batch size: 211, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:07:26,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1419132.0, ans=0.09899494936611666 2023-06-25 21:07:41,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1419192.0, ans=0.0 2023-06-25 21:08:00,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1419252.0, ans=0.1 2023-06-25 21:08:29,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1419312.0, ans=0.1 2023-06-25 21:08:40,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1419312.0, ans=0.125 2023-06-25 21:08:42,762 INFO [train.py:996] (3/4) Epoch 8, batch 23100, loss[loss=0.1705, simple_loss=0.2366, pruned_loss=0.05216, over 21598.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2951, pruned_loss=0.0711, over 4272164.74 frames. ], batch size: 247, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:09:44,362 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.818e+02 4.180e+02 5.701e+02 8.990e+02 1.720e+03, threshold=1.140e+03, percent-clipped=10.0 2023-06-25 21:09:48,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1419552.0, ans=0.2 2023-06-25 21:10:20,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-25 21:10:22,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-25 21:10:30,262 INFO [train.py:996] (3/4) Epoch 8, batch 23150, loss[loss=0.1934, simple_loss=0.2655, pruned_loss=0.06069, over 21830.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2896, pruned_loss=0.0702, over 4280119.90 frames. ], batch size: 247, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:11:16,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1419792.0, ans=0.0 2023-06-25 21:11:36,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1419852.0, ans=0.1 2023-06-25 21:12:17,933 INFO [train.py:996] (3/4) Epoch 8, batch 23200, loss[loss=0.2196, simple_loss=0.2917, pruned_loss=0.07377, over 21383.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2886, pruned_loss=0.07113, over 4286540.97 frames. ], batch size: 159, lr: 3.67e-03, grad_scale: 32.0 2023-06-25 21:13:16,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1420092.0, ans=0.125 2023-06-25 21:13:19,450 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.122e+02 4.151e+02 5.652e+02 8.200e+02 1.593e+03, threshold=1.130e+03, percent-clipped=6.0 2023-06-25 21:13:30,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1420152.0, ans=0.04949747468305833 2023-06-25 21:13:59,465 INFO [train.py:996] (3/4) Epoch 8, batch 23250, loss[loss=0.2113, simple_loss=0.2853, pruned_loss=0.06863, over 21901.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2902, pruned_loss=0.07251, over 4286055.43 frames. ], batch size: 316, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:14:01,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1420272.0, ans=0.125 2023-06-25 21:14:10,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1420272.0, ans=0.125 2023-06-25 21:14:35,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1420332.0, ans=0.125 2023-06-25 21:14:36,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-06-25 21:14:37,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1420332.0, ans=0.0 2023-06-25 21:15:20,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=22.5 2023-06-25 21:15:52,850 INFO [train.py:996] (3/4) Epoch 8, batch 23300, loss[loss=0.2132, simple_loss=0.3122, pruned_loss=0.0571, over 21811.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2975, pruned_loss=0.07499, over 4282383.69 frames. ], batch size: 282, lr: 3.67e-03, grad_scale: 16.0 2023-06-25 21:16:02,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1420572.0, ans=0.05 2023-06-25 21:16:57,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.217e+02 4.429e+02 5.607e+02 7.442e+02 1.718e+03, threshold=1.121e+03, percent-clipped=5.0 2023-06-25 21:17:13,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=15.0 2023-06-25 21:17:38,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1420812.0, ans=0.125 2023-06-25 21:17:41,345 INFO [train.py:996] (3/4) Epoch 8, batch 23350, loss[loss=0.2368, simple_loss=0.3219, pruned_loss=0.0759, over 20712.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.301, pruned_loss=0.07325, over 4274402.23 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:17:41,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1420872.0, ans=0.07 2023-06-25 21:18:01,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1420932.0, ans=0.0 2023-06-25 21:18:01,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1420932.0, ans=0.125 2023-06-25 21:18:46,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.95 vs. limit=10.0 2023-06-25 21:19:29,485 INFO [train.py:996] (3/4) Epoch 8, batch 23400, loss[loss=0.1788, simple_loss=0.2598, pruned_loss=0.04885, over 21093.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2941, pruned_loss=0.06978, over 4279501.41 frames. ], batch size: 608, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:20:34,180 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.708e+02 4.466e+02 6.262e+02 8.598e+02 1.529e+03, threshold=1.252e+03, percent-clipped=12.0 2023-06-25 21:20:50,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1421352.0, ans=0.0 2023-06-25 21:20:57,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1421352.0, ans=0.0 2023-06-25 21:20:57,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1421352.0, ans=0.125 2023-06-25 21:21:13,197 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.55 vs. limit=6.0 2023-06-25 21:21:17,392 INFO [train.py:996] (3/4) Epoch 8, batch 23450, loss[loss=0.2554, simple_loss=0.3395, pruned_loss=0.08559, over 21858.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2938, pruned_loss=0.07115, over 4284996.98 frames. ], batch size: 124, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:23:04,829 INFO [train.py:996] (3/4) Epoch 8, batch 23500, loss[loss=0.2093, simple_loss=0.2855, pruned_loss=0.06661, over 21884.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2945, pruned_loss=0.07193, over 4280652.19 frames. ], batch size: 124, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:24:07,688 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.437e+02 4.437e+02 5.920e+02 8.678e+02 1.556e+03, threshold=1.184e+03, percent-clipped=4.0 2023-06-25 21:24:34,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1422012.0, ans=0.0 2023-06-25 21:24:50,815 INFO [train.py:996] (3/4) Epoch 8, batch 23550, loss[loss=0.2079, simple_loss=0.2665, pruned_loss=0.07468, over 21579.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2902, pruned_loss=0.07206, over 4290381.27 frames. ], batch size: 213, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:25:22,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=22.5 2023-06-25 21:26:34,219 INFO [train.py:996] (3/4) Epoch 8, batch 23600, loss[loss=0.2508, simple_loss=0.3321, pruned_loss=0.08478, over 21453.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2899, pruned_loss=0.07244, over 4284530.11 frames. ], batch size: 131, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:26:45,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1422372.0, ans=0.5 2023-06-25 21:26:47,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1422372.0, ans=0.0 2023-06-25 21:26:51,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-25 21:27:12,810 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-25 21:27:24,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1422492.0, ans=0.2 2023-06-25 21:27:42,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1422552.0, ans=0.125 2023-06-25 21:27:45,446 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 4.385e+02 5.770e+02 8.074e+02 1.431e+03, threshold=1.154e+03, percent-clipped=6.0 2023-06-25 21:28:17,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1422672.0, ans=0.125 2023-06-25 21:28:19,010 INFO [train.py:996] (3/4) Epoch 8, batch 23650, loss[loss=0.245, simple_loss=0.3294, pruned_loss=0.08029, over 21563.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2897, pruned_loss=0.07064, over 4276835.52 frames. ], batch size: 414, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:29:17,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1422792.0, ans=0.5 2023-06-25 21:29:52,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1422912.0, ans=0.125 2023-06-25 21:29:54,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1422912.0, ans=0.125 2023-06-25 21:30:15,663 INFO [train.py:996] (3/4) Epoch 8, batch 23700, loss[loss=0.1922, simple_loss=0.2724, pruned_loss=0.05603, over 21745.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2916, pruned_loss=0.07, over 4274992.72 frames. ], batch size: 282, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:30:28,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1422972.0, ans=0.125 2023-06-25 21:30:49,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1423032.0, ans=0.0 2023-06-25 21:30:57,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1423032.0, ans=0.125 2023-06-25 21:30:58,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1423032.0, ans=0.1 2023-06-25 21:31:17,314 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:31:21,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.069e+02 4.706e+02 7.567e+02 1.059e+03 2.312e+03, threshold=1.513e+03, percent-clipped=21.0 2023-06-25 21:31:23,084 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-25 21:32:05,909 INFO [train.py:996] (3/4) Epoch 8, batch 23750, loss[loss=0.2065, simple_loss=0.2847, pruned_loss=0.0642, over 21040.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2938, pruned_loss=0.07062, over 4272992.59 frames. ], batch size: 143, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:32:40,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1423332.0, ans=0.125 2023-06-25 21:32:58,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1423392.0, ans=0.125 2023-06-25 21:33:22,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1423452.0, ans=0.0 2023-06-25 21:33:54,143 INFO [train.py:996] (3/4) Epoch 8, batch 23800, loss[loss=0.229, simple_loss=0.312, pruned_loss=0.07298, over 20643.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2942, pruned_loss=0.06935, over 4268768.03 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:34:01,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1423572.0, ans=0.0 2023-06-25 21:34:23,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1423632.0, ans=0.0 2023-06-25 21:35:08,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.992e+02 4.494e+02 6.635e+02 8.945e+02 1.790e+03, threshold=1.327e+03, percent-clipped=2.0 2023-06-25 21:35:08,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1423752.0, ans=0.0 2023-06-25 21:35:12,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1423752.0, ans=0.0 2023-06-25 21:35:12,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1423752.0, ans=0.0 2023-06-25 21:35:39,181 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-25 21:35:50,939 INFO [train.py:996] (3/4) Epoch 8, batch 23850, loss[loss=0.2161, simple_loss=0.2789, pruned_loss=0.07662, over 20034.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.302, pruned_loss=0.07131, over 4265424.24 frames. ], batch size: 702, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:36:20,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1423932.0, ans=0.1 2023-06-25 21:37:32,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1424112.0, ans=0.015 2023-06-25 21:37:40,722 INFO [train.py:996] (3/4) Epoch 8, batch 23900, loss[loss=0.2469, simple_loss=0.3245, pruned_loss=0.08468, over 21751.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3087, pruned_loss=0.07395, over 4273192.17 frames. ], batch size: 351, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:38:00,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1424232.0, ans=0.1 2023-06-25 21:38:00,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-25 21:38:09,492 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.02 vs. limit=5.0 2023-06-25 21:38:33,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1424292.0, ans=0.125 2023-06-25 21:38:41,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.117e+02 4.954e+02 6.480e+02 8.834e+02 1.664e+03, threshold=1.296e+03, percent-clipped=3.0 2023-06-25 21:39:00,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1424412.0, ans=0.015 2023-06-25 21:39:18,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1424412.0, ans=0.025 2023-06-25 21:39:19,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1424412.0, ans=0.125 2023-06-25 21:39:23,102 INFO [train.py:996] (3/4) Epoch 8, batch 23950, loss[loss=0.2292, simple_loss=0.2899, pruned_loss=0.08423, over 15008.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3036, pruned_loss=0.07366, over 4270610.89 frames. ], batch size: 60, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:40:06,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1424532.0, ans=0.125 2023-06-25 21:40:28,235 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:41:11,170 INFO [train.py:996] (3/4) Epoch 8, batch 24000, loss[loss=0.249, simple_loss=0.3194, pruned_loss=0.08929, over 21396.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3038, pruned_loss=0.07575, over 4273208.93 frames. ], batch size: 549, lr: 3.66e-03, grad_scale: 32.0 2023-06-25 21:41:11,170 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 21:41:29,306 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2655, simple_loss=0.3581, pruned_loss=0.0864, over 1796401.00 frames. 2023-06-25 21:41:29,307 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 21:42:20,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1424892.0, ans=0.125 2023-06-25 21:42:40,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1424952.0, ans=0.125 2023-06-25 21:42:49,032 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.318e+02 4.591e+02 6.093e+02 8.134e+02 1.870e+03, threshold=1.219e+03, percent-clipped=5.0 2023-06-25 21:43:18,403 INFO [train.py:996] (3/4) Epoch 8, batch 24050, loss[loss=0.2141, simple_loss=0.3037, pruned_loss=0.06222, over 21895.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3062, pruned_loss=0.07651, over 4276574.17 frames. ], batch size: 316, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:43:40,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1425132.0, ans=0.0 2023-06-25 21:44:50,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1425312.0, ans=0.125 2023-06-25 21:44:54,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1425312.0, ans=10.0 2023-06-25 21:44:55,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1425312.0, ans=0.125 2023-06-25 21:45:07,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1425312.0, ans=0.1 2023-06-25 21:45:14,043 INFO [train.py:996] (3/4) Epoch 8, batch 24100, loss[loss=0.2324, simple_loss=0.3133, pruned_loss=0.07573, over 21635.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3055, pruned_loss=0.07481, over 4267760.85 frames. ], batch size: 230, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:45:29,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1425372.0, ans=0.125 2023-06-25 21:46:27,130 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.209e+02 4.362e+02 5.817e+02 7.695e+02 1.790e+03, threshold=1.163e+03, percent-clipped=6.0 2023-06-25 21:46:55,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-25 21:47:02,455 INFO [train.py:996] (3/4) Epoch 8, batch 24150, loss[loss=0.2394, simple_loss=0.3248, pruned_loss=0.077, over 20626.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3065, pruned_loss=0.07604, over 4276583.94 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:48:11,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-25 21:48:22,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1425852.0, ans=0.125 2023-06-25 21:48:22,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1425852.0, ans=0.125 2023-06-25 21:48:39,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1425912.0, ans=0.0 2023-06-25 21:48:58,663 INFO [train.py:996] (3/4) Epoch 8, batch 24200, loss[loss=0.2457, simple_loss=0.3396, pruned_loss=0.07587, over 21614.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3093, pruned_loss=0.07731, over 4277772.44 frames. ], batch size: 389, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:49:18,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1426032.0, ans=0.125 2023-06-25 21:49:33,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1426032.0, ans=0.125 2023-06-25 21:50:02,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=22.5 2023-06-25 21:50:13,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.946e+02 4.269e+02 5.400e+02 8.843e+02 1.561e+03, threshold=1.080e+03, percent-clipped=7.0 2023-06-25 21:50:49,358 INFO [train.py:996] (3/4) Epoch 8, batch 24250, loss[loss=0.1887, simple_loss=0.2828, pruned_loss=0.04731, over 21457.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3067, pruned_loss=0.07156, over 4285825.26 frames. ], batch size: 194, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:51:12,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1426272.0, ans=0.1 2023-06-25 21:51:12,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1426272.0, ans=0.125 2023-06-25 21:52:36,520 INFO [train.py:996] (3/4) Epoch 8, batch 24300, loss[loss=0.1844, simple_loss=0.2624, pruned_loss=0.05325, over 21773.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.3014, pruned_loss=0.06702, over 4284647.37 frames. ], batch size: 298, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:53:04,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1426632.0, ans=0.025 2023-06-25 21:53:16,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1426632.0, ans=0.0 2023-06-25 21:53:30,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1426692.0, ans=0.0 2023-06-25 21:53:48,818 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.598e+02 3.813e+02 5.438e+02 8.323e+02 1.746e+03, threshold=1.088e+03, percent-clipped=13.0 2023-06-25 21:53:49,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1426752.0, ans=0.125 2023-06-25 21:54:14,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1426812.0, ans=0.125 2023-06-25 21:54:29,441 INFO [train.py:996] (3/4) Epoch 8, batch 24350, loss[loss=0.2372, simple_loss=0.3042, pruned_loss=0.08507, over 21549.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2981, pruned_loss=0.06722, over 4286982.89 frames. ], batch size: 548, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:54:33,531 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:55:35,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1427052.0, ans=0.1 2023-06-25 21:56:18,898 INFO [train.py:996] (3/4) Epoch 8, batch 24400, loss[loss=0.2477, simple_loss=0.3204, pruned_loss=0.08745, over 21604.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3015, pruned_loss=0.07086, over 4291317.73 frames. ], batch size: 441, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:56:58,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1427292.0, ans=0.125 2023-06-25 21:57:02,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1427292.0, ans=0.0 2023-06-25 21:57:02,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-25 21:57:02,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1427292.0, ans=15.0 2023-06-25 21:57:29,771 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.945e+02 4.612e+02 5.722e+02 8.222e+02 2.006e+03, threshold=1.144e+03, percent-clipped=13.0 2023-06-25 21:57:47,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1427412.0, ans=0.2 2023-06-25 21:58:07,720 INFO [train.py:996] (3/4) Epoch 8, batch 24450, loss[loss=0.2882, simple_loss=0.3819, pruned_loss=0.09731, over 21445.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3025, pruned_loss=0.07197, over 4280895.73 frames. ], batch size: 471, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 21:58:17,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1427472.0, ans=0.1 2023-06-25 21:59:55,386 INFO [train.py:996] (3/4) Epoch 8, batch 24500, loss[loss=0.2836, simple_loss=0.3332, pruned_loss=0.117, over 21731.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3031, pruned_loss=0.07215, over 4289132.72 frames. ], batch size: 508, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 22:00:45,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1427892.0, ans=0.0 2023-06-25 22:01:04,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.043e+02 4.093e+02 5.380e+02 7.688e+02 2.312e+03, threshold=1.076e+03, percent-clipped=10.0 2023-06-25 22:01:47,724 INFO [train.py:996] (3/4) Epoch 8, batch 24550, loss[loss=0.2278, simple_loss=0.2963, pruned_loss=0.07968, over 21575.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3046, pruned_loss=0.07379, over 4286831.87 frames. ], batch size: 263, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 22:02:13,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1428132.0, ans=0.2 2023-06-25 22:02:39,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1428192.0, ans=0.2 2023-06-25 22:02:40,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1428192.0, ans=0.1 2023-06-25 22:02:59,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1428252.0, ans=0.1 2023-06-25 22:03:11,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1428312.0, ans=0.0 2023-06-25 22:03:14,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1428312.0, ans=0.2 2023-06-25 22:03:34,847 INFO [train.py:996] (3/4) Epoch 8, batch 24600, loss[loss=0.1868, simple_loss=0.2556, pruned_loss=0.059, over 21743.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2991, pruned_loss=0.07385, over 4289321.39 frames. ], batch size: 124, lr: 3.66e-03, grad_scale: 16.0 2023-06-25 22:04:43,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.133e+02 4.316e+02 5.425e+02 7.027e+02 1.651e+03, threshold=1.085e+03, percent-clipped=8.0 2023-06-25 22:04:44,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1428552.0, ans=0.1 2023-06-25 22:05:21,814 INFO [train.py:996] (3/4) Epoch 8, batch 24650, loss[loss=0.1875, simple_loss=0.2494, pruned_loss=0.06281, over 21280.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2924, pruned_loss=0.07256, over 4277816.76 frames. ], batch size: 160, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:05:53,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1428732.0, ans=0.05 2023-06-25 22:06:00,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-25 22:06:34,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1428852.0, ans=0.1 2023-06-25 22:06:58,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-25 22:07:07,941 INFO [train.py:996] (3/4) Epoch 8, batch 24700, loss[loss=0.2035, simple_loss=0.2774, pruned_loss=0.06479, over 21591.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2903, pruned_loss=0.0709, over 4276224.19 frames. ], batch size: 332, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:07:29,754 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:08:17,095 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.664e+02 4.405e+02 6.289e+02 8.929e+02 2.025e+03, threshold=1.258e+03, percent-clipped=12.0 2023-06-25 22:08:49,436 INFO [train.py:996] (3/4) Epoch 8, batch 24750, loss[loss=0.1822, simple_loss=0.2439, pruned_loss=0.06025, over 21440.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2843, pruned_loss=0.06816, over 4274704.07 frames. ], batch size: 212, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:08:59,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-25 22:09:09,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1429332.0, ans=0.2 2023-06-25 22:09:19,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1429332.0, ans=0.1 2023-06-25 22:09:27,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1429332.0, ans=0.125 2023-06-25 22:09:56,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1429452.0, ans=0.125 2023-06-25 22:10:37,875 INFO [train.py:996] (3/4) Epoch 8, batch 24800, loss[loss=0.1673, simple_loss=0.2208, pruned_loss=0.05685, over 20767.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2795, pruned_loss=0.06854, over 4279077.95 frames. ], batch size: 609, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:10:40,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-25 22:10:43,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1429572.0, ans=0.1 2023-06-25 22:11:22,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1429692.0, ans=0.1 2023-06-25 22:11:49,238 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.829e+02 4.217e+02 5.954e+02 8.314e+02 1.595e+03, threshold=1.191e+03, percent-clipped=9.0 2023-06-25 22:12:20,369 INFO [train.py:996] (3/4) Epoch 8, batch 24850, loss[loss=0.2075, simple_loss=0.2865, pruned_loss=0.06424, over 21820.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2805, pruned_loss=0.07006, over 4276917.22 frames. ], batch size: 316, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:12:38,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1429932.0, ans=0.125 2023-06-25 22:12:43,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1429932.0, ans=0.04949747468305833 2023-06-25 22:12:59,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1429992.0, ans=0.125 2023-06-25 22:13:03,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1429992.0, ans=0.0 2023-06-25 22:13:47,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.54 vs. limit=15.0 2023-06-25 22:13:52,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1430112.0, ans=0.0 2023-06-25 22:14:08,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1430172.0, ans=0.0 2023-06-25 22:14:09,930 INFO [train.py:996] (3/4) Epoch 8, batch 24900, loss[loss=0.2474, simple_loss=0.3179, pruned_loss=0.08842, over 21492.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2842, pruned_loss=0.07127, over 4273574.11 frames. ], batch size: 194, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:14:27,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1430232.0, ans=0.2 2023-06-25 22:15:04,474 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=12.0 2023-06-25 22:15:31,582 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.069e+02 4.057e+02 5.546e+02 7.694e+02 2.051e+03, threshold=1.109e+03, percent-clipped=6.0 2023-06-25 22:15:40,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.36 vs. limit=10.0 2023-06-25 22:15:46,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1430412.0, ans=0.0 2023-06-25 22:15:47,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-25 22:15:58,275 INFO [train.py:996] (3/4) Epoch 8, batch 24950, loss[loss=0.2546, simple_loss=0.3353, pruned_loss=0.08699, over 21522.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2928, pruned_loss=0.07506, over 4279951.77 frames. ], batch size: 414, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:16:03,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-25 22:16:17,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1430472.0, ans=0.125 2023-06-25 22:16:29,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1430532.0, ans=0.2 2023-06-25 22:16:29,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1430532.0, ans=0.125 2023-06-25 22:17:02,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1430592.0, ans=0.125 2023-06-25 22:17:17,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1430652.0, ans=0.125 2023-06-25 22:17:19,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-25 22:17:35,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1430712.0, ans=0.025 2023-06-25 22:17:42,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1430712.0, ans=0.2 2023-06-25 22:17:46,841 INFO [train.py:996] (3/4) Epoch 8, batch 25000, loss[loss=0.2149, simple_loss=0.2904, pruned_loss=0.06976, over 21525.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2981, pruned_loss=0.0758, over 4281872.61 frames. ], batch size: 389, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:19:07,607 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.298e+02 4.363e+02 6.743e+02 9.687e+02 1.962e+03, threshold=1.349e+03, percent-clipped=15.0 2023-06-25 22:19:32,748 INFO [train.py:996] (3/4) Epoch 8, batch 25050, loss[loss=0.2268, simple_loss=0.2686, pruned_loss=0.0925, over 21453.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2922, pruned_loss=0.07501, over 4268214.67 frames. ], batch size: 510, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:19:49,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1431072.0, ans=10.0 2023-06-25 22:20:30,700 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-25 22:21:08,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1431312.0, ans=0.125 2023-06-25 22:21:13,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1431312.0, ans=0.0 2023-06-25 22:21:19,820 INFO [train.py:996] (3/4) Epoch 8, batch 25100, loss[loss=0.1983, simple_loss=0.2784, pruned_loss=0.05913, over 21574.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2863, pruned_loss=0.07332, over 4276381.08 frames. ], batch size: 195, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:21:41,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1431432.0, ans=0.0 2023-06-25 22:21:58,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1431432.0, ans=0.1 2023-06-25 22:22:26,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=12.0 2023-06-25 22:22:38,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1431552.0, ans=0.2 2023-06-25 22:22:41,588 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.362e+02 5.445e+02 8.840e+02 1.769e+03, threshold=1.089e+03, percent-clipped=5.0 2023-06-25 22:22:52,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1431612.0, ans=0.09899494936611666 2023-06-25 22:23:07,156 INFO [train.py:996] (3/4) Epoch 8, batch 25150, loss[loss=0.2258, simple_loss=0.3087, pruned_loss=0.07147, over 21398.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2906, pruned_loss=0.07117, over 4267940.19 frames. ], batch size: 211, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:23:09,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1431672.0, ans=0.1 2023-06-25 22:23:32,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1431732.0, ans=0.125 2023-06-25 22:24:14,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1431792.0, ans=0.2 2023-06-25 22:24:42,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1431912.0, ans=0.125 2023-06-25 22:24:55,095 INFO [train.py:996] (3/4) Epoch 8, batch 25200, loss[loss=0.2159, simple_loss=0.3119, pruned_loss=0.05998, over 21846.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2903, pruned_loss=0.06934, over 4267235.25 frames. ], batch size: 371, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:25:11,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=15.0 2023-06-25 22:25:43,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.71 vs. limit=22.5 2023-06-25 22:26:14,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1432152.0, ans=0.0 2023-06-25 22:26:18,383 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.791e+02 3.750e+02 5.347e+02 7.396e+02 1.859e+03, threshold=1.069e+03, percent-clipped=8.0 2023-06-25 22:26:35,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1432212.0, ans=0.07 2023-06-25 22:26:41,760 INFO [train.py:996] (3/4) Epoch 8, batch 25250, loss[loss=0.2078, simple_loss=0.2685, pruned_loss=0.07354, over 21366.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2875, pruned_loss=0.06858, over 4255475.84 frames. ], batch size: 144, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:26:42,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1432272.0, ans=0.1 2023-06-25 22:27:02,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1432272.0, ans=0.125 2023-06-25 22:27:14,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1432332.0, ans=0.0 2023-06-25 22:27:16,560 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-25 22:27:32,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1432392.0, ans=0.2 2023-06-25 22:28:29,121 INFO [train.py:996] (3/4) Epoch 8, batch 25300, loss[loss=0.2504, simple_loss=0.3348, pruned_loss=0.08298, over 21346.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2861, pruned_loss=0.06805, over 4256044.14 frames. ], batch size: 131, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:28:37,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1432572.0, ans=0.1 2023-06-25 22:29:24,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1432692.0, ans=0.125 2023-06-25 22:29:30,762 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-25 22:29:35,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1432692.0, ans=0.125 2023-06-25 22:29:39,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-25 22:29:44,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1432752.0, ans=0.95 2023-06-25 22:29:53,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.880e+02 4.048e+02 5.397e+02 7.800e+02 1.560e+03, threshold=1.079e+03, percent-clipped=8.0 2023-06-25 22:29:54,620 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:30:01,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1432812.0, ans=0.2 2023-06-25 22:30:17,477 INFO [train.py:996] (3/4) Epoch 8, batch 25350, loss[loss=0.1813, simple_loss=0.2756, pruned_loss=0.04356, over 21731.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2889, pruned_loss=0.06811, over 4248000.71 frames. ], batch size: 351, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:31:03,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1432932.0, ans=0.0 2023-06-25 22:31:24,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1433052.0, ans=0.0 2023-06-25 22:31:56,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1433112.0, ans=0.0 2023-06-25 22:31:59,552 INFO [train.py:996] (3/4) Epoch 8, batch 25400, loss[loss=0.2257, simple_loss=0.298, pruned_loss=0.07672, over 21599.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2841, pruned_loss=0.06655, over 4251976.35 frames. ], batch size: 389, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:33:08,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1433292.0, ans=0.125 2023-06-25 22:33:21,726 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.049e+02 4.073e+02 6.227e+02 9.020e+02 1.627e+03, threshold=1.245e+03, percent-clipped=13.0 2023-06-25 22:33:45,989 INFO [train.py:996] (3/4) Epoch 8, batch 25450, loss[loss=0.2075, simple_loss=0.3008, pruned_loss=0.05714, over 21324.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2843, pruned_loss=0.0685, over 4247276.91 frames. ], batch size: 548, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:33:53,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1433472.0, ans=0.0 2023-06-25 22:33:55,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1433472.0, ans=0.05 2023-06-25 22:34:12,749 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-25 22:34:38,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1433592.0, ans=0.0 2023-06-25 22:34:46,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1433592.0, ans=0.0 2023-06-25 22:35:18,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=22.5 2023-06-25 22:35:31,612 INFO [train.py:996] (3/4) Epoch 8, batch 25500, loss[loss=0.2064, simple_loss=0.2962, pruned_loss=0.05828, over 21713.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2843, pruned_loss=0.06487, over 4242496.91 frames. ], batch size: 298, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:36:36,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1433892.0, ans=0.2 2023-06-25 22:36:43,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1433892.0, ans=0.0 2023-06-25 22:36:56,876 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.763e+02 3.870e+02 4.829e+02 7.230e+02 1.638e+03, threshold=9.659e+02, percent-clipped=1.0 2023-06-25 22:37:21,625 INFO [train.py:996] (3/4) Epoch 8, batch 25550, loss[loss=0.1914, simple_loss=0.2962, pruned_loss=0.04328, over 21629.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2909, pruned_loss=0.0652, over 4237843.67 frames. ], batch size: 230, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:38:15,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.12 vs. limit=12.0 2023-06-25 22:38:25,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-25 22:38:33,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1434252.0, ans=0.125 2023-06-25 22:38:40,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1434252.0, ans=0.125 2023-06-25 22:38:44,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1434252.0, ans=0.04949747468305833 2023-06-25 22:38:44,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1434252.0, ans=0.125 2023-06-25 22:38:52,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1434312.0, ans=0.0 2023-06-25 22:39:20,112 INFO [train.py:996] (3/4) Epoch 8, batch 25600, loss[loss=0.2327, simple_loss=0.3087, pruned_loss=0.07834, over 21782.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2942, pruned_loss=0.06585, over 4235509.67 frames. ], batch size: 332, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:39:31,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1434372.0, ans=0.0 2023-06-25 22:40:18,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1434492.0, ans=0.0 2023-06-25 22:40:22,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1434552.0, ans=0.125 2023-06-25 22:40:29,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1434552.0, ans=0.125 2023-06-25 22:40:31,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.098e+02 4.217e+02 6.682e+02 9.360e+02 1.950e+03, threshold=1.336e+03, percent-clipped=22.0 2023-06-25 22:41:11,568 INFO [train.py:996] (3/4) Epoch 8, batch 25650, loss[loss=0.2114, simple_loss=0.2778, pruned_loss=0.0725, over 21825.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2945, pruned_loss=0.06751, over 4237096.68 frames. ], batch size: 107, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:41:13,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1434672.0, ans=0.0 2023-06-25 22:41:45,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=12.0 2023-06-25 22:41:54,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1434792.0, ans=0.125 2023-06-25 22:42:58,607 INFO [train.py:996] (3/4) Epoch 8, batch 25700, loss[loss=0.197, simple_loss=0.2757, pruned_loss=0.05912, over 21787.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2908, pruned_loss=0.06823, over 4240602.00 frames. ], batch size: 282, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:43:15,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1434972.0, ans=0.1 2023-06-25 22:43:26,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.53 vs. limit=22.5 2023-06-25 22:43:39,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1435092.0, ans=0.04949747468305833 2023-06-25 22:44:03,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1435152.0, ans=0.0 2023-06-25 22:44:06,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.951e+02 3.970e+02 5.194e+02 7.142e+02 1.504e+03, threshold=1.039e+03, percent-clipped=1.0 2023-06-25 22:44:51,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1435272.0, ans=0.125 2023-06-25 22:44:52,935 INFO [train.py:996] (3/4) Epoch 8, batch 25750, loss[loss=0.2916, simple_loss=0.3494, pruned_loss=0.1169, over 21420.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2959, pruned_loss=0.07151, over 4250275.10 frames. ], batch size: 471, lr: 3.65e-03, grad_scale: 32.0 2023-06-25 22:44:55,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1435272.0, ans=0.1 2023-06-25 22:45:03,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1435272.0, ans=0.0 2023-06-25 22:46:01,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1435452.0, ans=0.125 2023-06-25 22:46:33,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1435512.0, ans=0.125 2023-06-25 22:46:33,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1435512.0, ans=0.125 2023-06-25 22:46:45,426 INFO [train.py:996] (3/4) Epoch 8, batch 25800, loss[loss=0.2626, simple_loss=0.3304, pruned_loss=0.09742, over 21582.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3077, pruned_loss=0.0761, over 4255249.90 frames. ], batch size: 263, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:47:45,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1435692.0, ans=0.09899494936611666 2023-06-25 22:48:11,405 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.235e+02 4.952e+02 6.520e+02 9.122e+02 2.118e+03, threshold=1.304e+03, percent-clipped=17.0 2023-06-25 22:48:29,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1435812.0, ans=0.0 2023-06-25 22:48:33,930 INFO [train.py:996] (3/4) Epoch 8, batch 25850, loss[loss=0.2106, simple_loss=0.2835, pruned_loss=0.06882, over 21452.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3095, pruned_loss=0.07639, over 4265680.71 frames. ], batch size: 131, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:48:43,768 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-06-25 22:48:47,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1435872.0, ans=0.1 2023-06-25 22:49:27,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.91 vs. limit=22.5 2023-06-25 22:50:03,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1436052.0, ans=0.07 2023-06-25 22:50:23,325 INFO [train.py:996] (3/4) Epoch 8, batch 25900, loss[loss=0.2353, simple_loss=0.3276, pruned_loss=0.07151, over 21571.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3107, pruned_loss=0.07708, over 4274957.22 frames. ], batch size: 230, lr: 3.65e-03, grad_scale: 16.0 2023-06-25 22:50:53,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1436232.0, ans=0.125 2023-06-25 22:51:16,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.50 vs. limit=15.0 2023-06-25 22:51:43,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.598e+02 5.216e+02 8.298e+02 1.003e+03 1.891e+03, threshold=1.660e+03, percent-clipped=7.0 2023-06-25 22:52:06,580 INFO [train.py:996] (3/4) Epoch 8, batch 25950, loss[loss=0.2433, simple_loss=0.3223, pruned_loss=0.08218, over 21583.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3171, pruned_loss=0.07957, over 4274142.62 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 22:52:08,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1436472.0, ans=0.0 2023-06-25 22:52:09,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.30 vs. limit=15.0 2023-06-25 22:52:46,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-25 22:52:48,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1436532.0, ans=0.2 2023-06-25 22:53:36,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-25 22:53:55,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1436712.0, ans=0.2 2023-06-25 22:53:58,751 INFO [train.py:996] (3/4) Epoch 8, batch 26000, loss[loss=0.2619, simple_loss=0.3345, pruned_loss=0.09466, over 21731.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3162, pruned_loss=0.0778, over 4273017.52 frames. ], batch size: 441, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 22:54:42,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1436832.0, ans=0.2 2023-06-25 22:55:19,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1436952.0, ans=0.0 2023-06-25 22:55:20,390 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.951e+02 4.119e+02 5.246e+02 6.904e+02 1.299e+03, threshold=1.049e+03, percent-clipped=0.0 2023-06-25 22:55:39,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1437012.0, ans=0.0 2023-06-25 22:55:41,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1437012.0, ans=0.0 2023-06-25 22:55:46,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1437072.0, ans=0.1 2023-06-25 22:55:47,865 INFO [train.py:996] (3/4) Epoch 8, batch 26050, loss[loss=0.21, simple_loss=0.2847, pruned_loss=0.06767, over 21861.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3151, pruned_loss=0.07865, over 4271134.13 frames. ], batch size: 118, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 22:55:55,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1437072.0, ans=0.0 2023-06-25 22:56:07,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1437072.0, ans=0.125 2023-06-25 22:57:28,591 INFO [train.py:996] (3/4) Epoch 8, batch 26100, loss[loss=0.2536, simple_loss=0.3052, pruned_loss=0.101, over 21823.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3103, pruned_loss=0.07816, over 4278592.18 frames. ], batch size: 508, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 22:58:22,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1437492.0, ans=0.125 2023-06-25 22:58:24,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1437492.0, ans=0.125 2023-06-25 22:58:44,089 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.195e+02 4.438e+02 5.651e+02 7.112e+02 1.480e+03, threshold=1.130e+03, percent-clipped=4.0 2023-06-25 22:59:22,553 INFO [train.py:996] (3/4) Epoch 8, batch 26150, loss[loss=0.2302, simple_loss=0.302, pruned_loss=0.07914, over 21658.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3079, pruned_loss=0.07746, over 4279563.20 frames. ], batch size: 230, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 22:59:28,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1437672.0, ans=0.125 2023-06-25 23:00:09,530 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:00:37,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1437852.0, ans=0.2 2023-06-25 23:00:42,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1437912.0, ans=0.1 2023-06-25 23:01:12,295 INFO [train.py:996] (3/4) Epoch 8, batch 26200, loss[loss=0.2176, simple_loss=0.3063, pruned_loss=0.06444, over 21253.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3083, pruned_loss=0.07517, over 4282068.79 frames. ], batch size: 159, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:01:16,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1437972.0, ans=0.1 2023-06-25 23:01:43,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1438032.0, ans=0.125 2023-06-25 23:01:52,772 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-25 23:02:23,567 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.227e+02 4.470e+02 5.888e+02 8.750e+02 1.495e+03, threshold=1.178e+03, percent-clipped=8.0 2023-06-25 23:02:47,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1438212.0, ans=0.2 2023-06-25 23:02:55,427 INFO [train.py:996] (3/4) Epoch 8, batch 26250, loss[loss=0.2493, simple_loss=0.3275, pruned_loss=0.08555, over 21589.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3128, pruned_loss=0.07464, over 4287554.61 frames. ], batch size: 471, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:03:16,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1438332.0, ans=0.125 2023-06-25 23:03:57,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1438452.0, ans=0.07 2023-06-25 23:04:00,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1438452.0, ans=0.1 2023-06-25 23:04:36,263 INFO [train.py:996] (3/4) Epoch 8, batch 26300, loss[loss=0.2439, simple_loss=0.3138, pruned_loss=0.08696, over 21335.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3096, pruned_loss=0.07525, over 4292706.20 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:05:01,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1438632.0, ans=0.0 2023-06-25 23:05:34,991 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:06:03,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.391e+02 4.218e+02 5.396e+02 7.440e+02 1.508e+03, threshold=1.079e+03, percent-clipped=2.0 2023-06-25 23:06:08,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1438812.0, ans=0.0 2023-06-25 23:06:20,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1438812.0, ans=0.1 2023-06-25 23:06:24,586 INFO [train.py:996] (3/4) Epoch 8, batch 26350, loss[loss=0.2863, simple_loss=0.3489, pruned_loss=0.1119, over 21437.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3089, pruned_loss=0.07677, over 4293466.91 frames. ], batch size: 471, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:07:09,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1438992.0, ans=0.125 2023-06-25 23:07:12,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1438992.0, ans=0.1 2023-06-25 23:07:33,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1439052.0, ans=0.125 2023-06-25 23:07:49,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1439052.0, ans=0.125 2023-06-25 23:08:11,326 INFO [train.py:996] (3/4) Epoch 8, batch 26400, loss[loss=0.2054, simple_loss=0.2685, pruned_loss=0.07118, over 21779.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3024, pruned_loss=0.07605, over 4289719.02 frames. ], batch size: 317, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:08:31,710 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-25 23:08:50,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1439292.0, ans=0.125 2023-06-25 23:09:22,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1439352.0, ans=0.125 2023-06-25 23:09:29,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=1439352.0, ans=0.2 2023-06-25 23:09:31,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1439352.0, ans=0.5 2023-06-25 23:09:36,046 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.120e+02 4.025e+02 5.044e+02 7.451e+02 1.741e+03, threshold=1.009e+03, percent-clipped=9.0 2023-06-25 23:09:57,664 INFO [train.py:996] (3/4) Epoch 8, batch 26450, loss[loss=0.2281, simple_loss=0.3319, pruned_loss=0.06219, over 21674.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2998, pruned_loss=0.07505, over 4283298.51 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:10:29,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1439532.0, ans=0.125 2023-06-25 23:11:10,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1439652.0, ans=0.125 2023-06-25 23:11:48,703 INFO [train.py:996] (3/4) Epoch 8, batch 26500, loss[loss=0.2112, simple_loss=0.3002, pruned_loss=0.06109, over 21833.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3031, pruned_loss=0.07423, over 4278198.63 frames. ], batch size: 316, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:12:38,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1439832.0, ans=0.0 2023-06-25 23:12:56,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1439892.0, ans=0.0 2023-06-25 23:13:04,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1439952.0, ans=0.02 2023-06-25 23:13:23,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.879e+02 4.514e+02 6.896e+02 1.400e+03 2.768e+03, threshold=1.379e+03, percent-clipped=34.0 2023-06-25 23:13:32,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1440012.0, ans=0.125 2023-06-25 23:13:53,898 INFO [train.py:996] (3/4) Epoch 8, batch 26550, loss[loss=0.159, simple_loss=0.2198, pruned_loss=0.04907, over 21289.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3002, pruned_loss=0.07201, over 4270660.20 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:14:17,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1440132.0, ans=0.125 2023-06-25 23:14:19,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1440132.0, ans=0.1 2023-06-25 23:15:02,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1440252.0, ans=0.0 2023-06-25 23:15:13,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1440312.0, ans=0.0 2023-06-25 23:15:16,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1440312.0, ans=0.125 2023-06-25 23:15:35,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1440312.0, ans=0.125 2023-06-25 23:15:47,292 INFO [train.py:996] (3/4) Epoch 8, batch 26600, loss[loss=0.1915, simple_loss=0.2648, pruned_loss=0.05912, over 21393.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2994, pruned_loss=0.06918, over 4269164.28 frames. ], batch size: 211, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:16:14,892 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-25 23:16:16,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1440432.0, ans=0.125 2023-06-25 23:17:00,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.847e+02 4.407e+02 5.733e+02 8.512e+02 1.391e+03, threshold=1.147e+03, percent-clipped=1.0 2023-06-25 23:17:35,755 INFO [train.py:996] (3/4) Epoch 8, batch 26650, loss[loss=0.1566, simple_loss=0.2442, pruned_loss=0.03449, over 21695.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2922, pruned_loss=0.06771, over 4271001.36 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:17:39,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.70 vs. limit=15.0 2023-06-25 23:18:24,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1440792.0, ans=0.0 2023-06-25 23:18:44,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1440852.0, ans=0.125 2023-06-25 23:19:18,179 INFO [train.py:996] (3/4) Epoch 8, batch 26700, loss[loss=0.2305, simple_loss=0.2995, pruned_loss=0.08074, over 21914.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2866, pruned_loss=0.06632, over 4262771.30 frames. ], batch size: 316, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:19:40,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1441032.0, ans=0.04949747468305833 2023-06-25 23:19:46,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1441032.0, ans=0.0 2023-06-25 23:20:05,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-25 23:20:09,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1441092.0, ans=0.125 2023-06-25 23:20:25,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1441152.0, ans=0.07 2023-06-25 23:20:37,586 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.642e+02 3.824e+02 5.567e+02 8.569e+02 1.745e+03, threshold=1.113e+03, percent-clipped=13.0 2023-06-25 23:21:01,560 INFO [train.py:996] (3/4) Epoch 8, batch 26750, loss[loss=0.2778, simple_loss=0.3503, pruned_loss=0.1026, over 21410.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2863, pruned_loss=0.06495, over 4268999.87 frames. ], batch size: 471, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:21:02,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1441272.0, ans=15.0 2023-06-25 23:21:16,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1441272.0, ans=0.2 2023-06-25 23:21:27,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-25 23:21:31,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1441332.0, ans=0.0 2023-06-25 23:21:59,586 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.98 vs. limit=15.0 2023-06-25 23:22:03,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1441452.0, ans=0.0 2023-06-25 23:22:46,500 INFO [train.py:996] (3/4) Epoch 8, batch 26800, loss[loss=0.2604, simple_loss=0.3473, pruned_loss=0.08673, over 21406.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2949, pruned_loss=0.06996, over 4278869.57 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:24:14,279 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.253e+02 4.422e+02 6.215e+02 9.798e+02 1.990e+03, threshold=1.243e+03, percent-clipped=9.0 2023-06-25 23:24:27,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1441812.0, ans=0.125 2023-06-25 23:24:38,154 INFO [train.py:996] (3/4) Epoch 8, batch 26850, loss[loss=0.1995, simple_loss=0.2587, pruned_loss=0.07012, over 20628.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2962, pruned_loss=0.07221, over 4274249.10 frames. ], batch size: 607, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:24:44,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=22.5 2023-06-25 23:25:23,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1441992.0, ans=0.1 2023-06-25 23:25:31,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1441992.0, ans=0.035 2023-06-25 23:26:00,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1442112.0, ans=0.1 2023-06-25 23:26:02,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1442112.0, ans=0.2 2023-06-25 23:26:20,004 INFO [train.py:996] (3/4) Epoch 8, batch 26900, loss[loss=0.1935, simple_loss=0.258, pruned_loss=0.0645, over 21540.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2892, pruned_loss=0.07227, over 4272830.00 frames. ], batch size: 132, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:26:43,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1442232.0, ans=0.09899494936611666 2023-06-25 23:27:23,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1442352.0, ans=0.2 2023-06-25 23:27:40,233 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.068e+02 3.925e+02 6.896e+02 1.001e+03 2.184e+03, threshold=1.379e+03, percent-clipped=14.0 2023-06-25 23:28:02,731 INFO [train.py:996] (3/4) Epoch 8, batch 26950, loss[loss=0.2064, simple_loss=0.2892, pruned_loss=0.06175, over 21263.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.288, pruned_loss=0.07147, over 4274186.56 frames. ], batch size: 159, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:28:29,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1442532.0, ans=0.125 2023-06-25 23:29:52,141 INFO [train.py:996] (3/4) Epoch 8, batch 27000, loss[loss=0.2115, simple_loss=0.304, pruned_loss=0.05949, over 21599.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2886, pruned_loss=0.06934, over 4270573.08 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:29:52,141 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-25 23:30:10,468 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2506, simple_loss=0.341, pruned_loss=0.08006, over 1796401.00 frames. 2023-06-25 23:30:10,469 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23654MB 2023-06-25 23:30:17,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1442772.0, ans=0.2 2023-06-25 23:31:32,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.635e+02 4.043e+02 5.265e+02 7.888e+02 2.132e+03, threshold=1.053e+03, percent-clipped=7.0 2023-06-25 23:31:49,544 INFO [train.py:996] (3/4) Epoch 8, batch 27050, loss[loss=0.1906, simple_loss=0.3015, pruned_loss=0.03983, over 21586.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2899, pruned_loss=0.06587, over 4268456.62 frames. ], batch size: 263, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:32:15,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1443132.0, ans=0.0 2023-06-25 23:32:26,764 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-06-25 23:32:35,643 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-25 23:33:32,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1443312.0, ans=0.125 2023-06-25 23:33:38,820 INFO [train.py:996] (3/4) Epoch 8, batch 27100, loss[loss=0.2089, simple_loss=0.3045, pruned_loss=0.05662, over 20952.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2928, pruned_loss=0.0671, over 4274632.40 frames. ], batch size: 607, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:33:39,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1443372.0, ans=0.2 2023-06-25 23:34:13,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1443432.0, ans=0.125 2023-06-25 23:35:04,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1443552.0, ans=0.125 2023-06-25 23:35:08,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1443552.0, ans=0.125 2023-06-25 23:35:11,010 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.112e+02 4.566e+02 6.448e+02 9.782e+02 2.509e+03, threshold=1.290e+03, percent-clipped=22.0 2023-06-25 23:35:33,802 INFO [train.py:996] (3/4) Epoch 8, batch 27150, loss[loss=0.1963, simple_loss=0.2913, pruned_loss=0.05068, over 20990.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3034, pruned_loss=0.07016, over 4269929.05 frames. ], batch size: 607, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:35:52,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1443672.0, ans=0.125 2023-06-25 23:37:26,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1443972.0, ans=0.125 2023-06-25 23:37:28,318 INFO [train.py:996] (3/4) Epoch 8, batch 27200, loss[loss=0.2425, simple_loss=0.3202, pruned_loss=0.08244, over 21527.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3106, pruned_loss=0.07314, over 4276354.33 frames. ], batch size: 194, lr: 3.64e-03, grad_scale: 32.0 2023-06-25 23:37:39,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1443972.0, ans=0.0 2023-06-25 23:37:44,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1444032.0, ans=0.1 2023-06-25 23:37:44,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1444032.0, ans=0.0 2023-06-25 23:37:48,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1444032.0, ans=0.05 2023-06-25 23:38:10,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1444092.0, ans=0.0 2023-06-25 23:38:20,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=22.5 2023-06-25 23:38:20,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.46 vs. limit=15.0 2023-06-25 23:38:35,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1444152.0, ans=0.125 2023-06-25 23:38:37,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1444152.0, ans=0.125 2023-06-25 23:39:01,346 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.411e+02 4.854e+02 6.757e+02 9.648e+02 1.735e+03, threshold=1.351e+03, percent-clipped=9.0 2023-06-25 23:39:08,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1444212.0, ans=0.125 2023-06-25 23:39:18,854 INFO [train.py:996] (3/4) Epoch 8, batch 27250, loss[loss=0.2548, simple_loss=0.3243, pruned_loss=0.09262, over 21382.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3132, pruned_loss=0.07635, over 4274971.60 frames. ], batch size: 176, lr: 3.64e-03, grad_scale: 16.0 2023-06-25 23:39:44,329 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:40:47,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.93 vs. limit=10.0 2023-06-25 23:40:56,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1444512.0, ans=0.02 2023-06-25 23:41:14,464 INFO [train.py:996] (3/4) Epoch 8, batch 27300, loss[loss=0.2631, simple_loss=0.352, pruned_loss=0.08713, over 21722.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3151, pruned_loss=0.07759, over 4272906.47 frames. ], batch size: 441, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:41:18,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1444572.0, ans=0.0 2023-06-25 23:41:23,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1444572.0, ans=0.0 2023-06-25 23:41:25,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1444572.0, ans=0.125 2023-06-25 23:41:38,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=12.0 2023-06-25 23:42:43,067 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.436e+02 5.757e+02 8.260e+02 1.524e+03, threshold=1.151e+03, percent-clipped=4.0 2023-06-25 23:42:45,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1444812.0, ans=0.125 2023-06-25 23:43:03,209 INFO [train.py:996] (3/4) Epoch 8, batch 27350, loss[loss=0.2766, simple_loss=0.3409, pruned_loss=0.1061, over 21527.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.318, pruned_loss=0.07829, over 4275866.21 frames. ], batch size: 507, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:44:13,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1445052.0, ans=0.1 2023-06-25 23:44:42,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1445112.0, ans=0.125 2023-06-25 23:44:50,164 INFO [train.py:996] (3/4) Epoch 8, batch 27400, loss[loss=0.2171, simple_loss=0.2836, pruned_loss=0.07525, over 21532.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3127, pruned_loss=0.07731, over 4281013.10 frames. ], batch size: 548, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:45:37,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-06-25 23:45:40,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1445292.0, ans=0.1 2023-06-25 23:46:10,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1445412.0, ans=0.125 2023-06-25 23:46:14,719 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.145e+02 3.925e+02 4.930e+02 6.414e+02 1.207e+03, threshold=9.861e+02, percent-clipped=2.0 2023-06-25 23:46:30,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1445412.0, ans=0.2 2023-06-25 23:46:33,522 INFO [train.py:996] (3/4) Epoch 8, batch 27450, loss[loss=0.2215, simple_loss=0.3036, pruned_loss=0.06966, over 21768.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3071, pruned_loss=0.07598, over 4276404.94 frames. ], batch size: 282, lr: 3.63e-03, grad_scale: 8.0 2023-06-25 23:46:34,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1445472.0, ans=0.0 2023-06-25 23:46:49,411 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:47:02,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1445532.0, ans=0.0 2023-06-25 23:48:18,965 INFO [train.py:996] (3/4) Epoch 8, batch 27500, loss[loss=0.2129, simple_loss=0.2864, pruned_loss=0.06972, over 21511.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3061, pruned_loss=0.07611, over 4281548.55 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 8.0 2023-06-25 23:48:28,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-25 23:48:37,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-25 23:49:05,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1445892.0, ans=0.0 2023-06-25 23:49:09,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1445892.0, ans=0.0 2023-06-25 23:49:15,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1445892.0, ans=0.125 2023-06-25 23:49:21,928 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.76 vs. limit=12.0 2023-06-25 23:49:42,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.981e+02 3.778e+02 4.835e+02 6.283e+02 1.305e+03, threshold=9.670e+02, percent-clipped=1.0 2023-06-25 23:50:01,309 INFO [train.py:996] (3/4) Epoch 8, batch 27550, loss[loss=0.1823, simple_loss=0.2512, pruned_loss=0.05667, over 21491.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2988, pruned_loss=0.07225, over 4279346.62 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 8.0 2023-06-25 23:51:02,032 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=12.0 2023-06-25 23:51:46,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1446312.0, ans=0.0 2023-06-25 23:51:49,459 INFO [train.py:996] (3/4) Epoch 8, batch 27600, loss[loss=0.2006, simple_loss=0.2708, pruned_loss=0.0652, over 21365.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2917, pruned_loss=0.07112, over 4273971.03 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:52:12,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1446432.0, ans=0.035 2023-06-25 23:52:29,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1446432.0, ans=0.125 2023-06-25 23:52:50,415 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-25 23:53:05,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1446552.0, ans=0.1 2023-06-25 23:53:16,406 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.997e+02 3.759e+02 4.592e+02 6.391e+02 1.970e+03, threshold=9.184e+02, percent-clipped=8.0 2023-06-25 23:53:26,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1446612.0, ans=0.125 2023-06-25 23:53:32,700 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=22.5 2023-06-25 23:53:34,779 INFO [train.py:996] (3/4) Epoch 8, batch 27650, loss[loss=0.2003, simple_loss=0.2709, pruned_loss=0.06483, over 21624.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2861, pruned_loss=0.07041, over 4272752.71 frames. ], batch size: 263, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:54:13,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1446732.0, ans=0.125 2023-06-25 23:54:30,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2023-06-25 23:54:34,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1446792.0, ans=0.125 2023-06-25 23:54:43,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1446852.0, ans=0.125 2023-06-25 23:54:59,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1446912.0, ans=0.125 2023-06-25 23:55:22,822 INFO [train.py:996] (3/4) Epoch 8, batch 27700, loss[loss=0.2683, simple_loss=0.3589, pruned_loss=0.08889, over 20886.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2872, pruned_loss=0.06912, over 4269776.11 frames. ], batch size: 608, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:55:42,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1446972.0, ans=0.125 2023-06-25 23:56:09,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-06-25 23:56:56,235 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.091e+02 3.950e+02 5.187e+02 7.067e+02 1.545e+03, threshold=1.037e+03, percent-clipped=11.0 2023-06-25 23:57:09,950 INFO [train.py:996] (3/4) Epoch 8, batch 27750, loss[loss=0.2022, simple_loss=0.2833, pruned_loss=0.0605, over 21478.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2908, pruned_loss=0.06903, over 4271341.24 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:57:39,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1447332.0, ans=0.125 2023-06-25 23:58:02,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-25 23:58:54,840 INFO [train.py:996] (3/4) Epoch 8, batch 27800, loss[loss=0.2126, simple_loss=0.281, pruned_loss=0.07211, over 21399.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2902, pruned_loss=0.06923, over 4284668.76 frames. ], batch size: 176, lr: 3.63e-03, grad_scale: 16.0 2023-06-25 23:59:06,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1447572.0, ans=0.09899494936611666 2023-06-25 23:59:33,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1447632.0, ans=0.07 2023-06-25 23:59:37,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-26 00:00:15,888 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:00:23,976 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.743e+02 4.274e+02 5.854e+02 7.453e+02 1.495e+03, threshold=1.171e+03, percent-clipped=16.0 2023-06-26 00:00:42,959 INFO [train.py:996] (3/4) Epoch 8, batch 27850, loss[loss=0.2166, simple_loss=0.3001, pruned_loss=0.06654, over 21824.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2902, pruned_loss=0.07039, over 4287169.31 frames. ], batch size: 112, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:00:45,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1447872.0, ans=0.05 2023-06-26 00:00:46,350 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-06-26 00:01:20,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1447932.0, ans=0.125 2023-06-26 00:01:30,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1447932.0, ans=0.0 2023-06-26 00:02:23,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1448112.0, ans=0.125 2023-06-26 00:02:39,164 INFO [train.py:996] (3/4) Epoch 8, batch 27900, loss[loss=0.2402, simple_loss=0.3378, pruned_loss=0.0713, over 21698.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3012, pruned_loss=0.07216, over 4289614.34 frames. ], batch size: 414, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:02:43,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1448172.0, ans=0.125 2023-06-26 00:03:28,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-26 00:04:08,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.75 vs. limit=10.0 2023-06-26 00:04:15,996 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.747e+02 3.981e+02 4.843e+02 6.105e+02 1.501e+03, threshold=9.685e+02, percent-clipped=1.0 2023-06-26 00:04:35,186 INFO [train.py:996] (3/4) Epoch 8, batch 27950, loss[loss=0.2352, simple_loss=0.3316, pruned_loss=0.06941, over 21605.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3008, pruned_loss=0.06899, over 4288874.98 frames. ], batch size: 414, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:04:35,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1448472.0, ans=0.04949747468305833 2023-06-26 00:05:05,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1448532.0, ans=0.0 2023-06-26 00:06:22,306 INFO [train.py:996] (3/4) Epoch 8, batch 28000, loss[loss=0.2184, simple_loss=0.2878, pruned_loss=0.07447, over 21476.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2981, pruned_loss=0.06694, over 4294189.79 frames. ], batch size: 194, lr: 3.63e-03, grad_scale: 32.0 2023-06-26 00:06:48,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1448832.0, ans=0.1 2023-06-26 00:07:20,611 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-26 00:07:58,624 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.068e+02 4.485e+02 6.487e+02 9.458e+02 1.758e+03, threshold=1.297e+03, percent-clipped=21.0 2023-06-26 00:08:04,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2023-06-26 00:08:10,947 INFO [train.py:996] (3/4) Epoch 8, batch 28050, loss[loss=0.2247, simple_loss=0.2859, pruned_loss=0.08177, over 20007.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2953, pruned_loss=0.06859, over 4289619.88 frames. ], batch size: 702, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:08:20,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1449072.0, ans=0.05 2023-06-26 00:08:20,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1449072.0, ans=0.0 2023-06-26 00:08:27,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1449132.0, ans=0.07 2023-06-26 00:09:29,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1449252.0, ans=0.1 2023-06-26 00:09:31,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1449252.0, ans=0.025 2023-06-26 00:09:31,531 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-06-26 00:09:43,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1449312.0, ans=0.0 2023-06-26 00:09:57,865 INFO [train.py:996] (3/4) Epoch 8, batch 28100, loss[loss=0.1901, simple_loss=0.2644, pruned_loss=0.05785, over 21159.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2919, pruned_loss=0.06851, over 4285860.62 frames. ], batch size: 548, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:10:01,035 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-26 00:11:09,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1449552.0, ans=0.125 2023-06-26 00:11:27,650 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.094e+02 4.530e+02 6.783e+02 9.812e+02 2.062e+03, threshold=1.357e+03, percent-clipped=16.0 2023-06-26 00:11:33,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1449612.0, ans=0.0 2023-06-26 00:11:33,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1449612.0, ans=0.05 2023-06-26 00:11:39,985 INFO [train.py:996] (3/4) Epoch 8, batch 28150, loss[loss=0.2045, simple_loss=0.2676, pruned_loss=0.07067, over 21580.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2859, pruned_loss=0.068, over 4282315.03 frames. ], batch size: 415, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:11:50,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1449672.0, ans=0.125 2023-06-26 00:12:03,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1449732.0, ans=0.0 2023-06-26 00:12:13,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1449732.0, ans=0.2 2023-06-26 00:13:09,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1449912.0, ans=0.0 2023-06-26 00:13:26,689 INFO [train.py:996] (3/4) Epoch 8, batch 28200, loss[loss=0.2208, simple_loss=0.2868, pruned_loss=0.07742, over 21396.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.284, pruned_loss=0.06986, over 4282449.59 frames. ], batch size: 211, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:14:00,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1450032.0, ans=0.125 2023-06-26 00:14:01,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.65 vs. limit=15.0 2023-06-26 00:14:44,786 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.37 vs. limit=10.0 2023-06-26 00:15:01,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1450212.0, ans=0.125 2023-06-26 00:15:02,515 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.442e+02 4.547e+02 5.710e+02 8.432e+02 1.923e+03, threshold=1.142e+03, percent-clipped=7.0 2023-06-26 00:15:14,894 INFO [train.py:996] (3/4) Epoch 8, batch 28250, loss[loss=0.1954, simple_loss=0.2732, pruned_loss=0.05878, over 16570.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2888, pruned_loss=0.07163, over 4263868.58 frames. ], batch size: 60, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:15:29,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1450272.0, ans=0.125 2023-06-26 00:17:04,116 INFO [train.py:996] (3/4) Epoch 8, batch 28300, loss[loss=0.2033, simple_loss=0.2757, pruned_loss=0.06542, over 21258.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2886, pruned_loss=0.07017, over 4258120.70 frames. ], batch size: 160, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:17:27,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-26 00:17:44,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1450632.0, ans=0.125 2023-06-26 00:18:37,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1450812.0, ans=0.125 2023-06-26 00:18:39,089 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.826e+02 4.235e+02 6.929e+02 1.082e+03 2.013e+03, threshold=1.386e+03, percent-clipped=23.0 2023-06-26 00:18:56,470 INFO [train.py:996] (3/4) Epoch 8, batch 28350, loss[loss=0.1981, simple_loss=0.2985, pruned_loss=0.04888, over 21746.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2857, pruned_loss=0.06466, over 4266896.63 frames. ], batch size: 332, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:19:05,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1450872.0, ans=0.1 2023-06-26 00:19:45,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1450992.0, ans=0.0 2023-06-26 00:20:06,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1451052.0, ans=0.2 2023-06-26 00:20:43,835 INFO [train.py:996] (3/4) Epoch 8, batch 28400, loss[loss=0.2282, simple_loss=0.296, pruned_loss=0.08023, over 21198.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2829, pruned_loss=0.06486, over 4264814.18 frames. ], batch size: 143, lr: 3.63e-03, grad_scale: 32.0 2023-06-26 00:20:45,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-26 00:21:35,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1451292.0, ans=0.1 2023-06-26 00:21:38,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1451292.0, ans=0.125 2023-06-26 00:21:52,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1451352.0, ans=0.125 2023-06-26 00:22:03,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.22 vs. limit=10.0 2023-06-26 00:22:20,868 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.356e+02 4.435e+02 6.673e+02 8.870e+02 1.776e+03, threshold=1.335e+03, percent-clipped=3.0 2023-06-26 00:22:31,531 INFO [train.py:996] (3/4) Epoch 8, batch 28450, loss[loss=0.2216, simple_loss=0.2879, pruned_loss=0.07767, over 21638.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2886, pruned_loss=0.06845, over 4259539.02 frames. ], batch size: 230, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:22:50,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1451472.0, ans=0.1 2023-06-26 00:22:54,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1451472.0, ans=0.035 2023-06-26 00:23:01,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1451532.0, ans=0.0 2023-06-26 00:24:04,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1451712.0, ans=0.0 2023-06-26 00:24:30,328 INFO [train.py:996] (3/4) Epoch 8, batch 28500, loss[loss=0.223, simple_loss=0.291, pruned_loss=0.07753, over 21922.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2897, pruned_loss=0.0701, over 4265813.66 frames. ], batch size: 351, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:24:41,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1451772.0, ans=0.125 2023-06-26 00:24:52,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1451832.0, ans=0.0 2023-06-26 00:24:59,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1451832.0, ans=0.125 2023-06-26 00:25:09,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1451892.0, ans=0.1 2023-06-26 00:25:17,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1451892.0, ans=0.125 2023-06-26 00:26:09,285 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.427e+02 4.818e+02 6.676e+02 8.470e+02 2.134e+03, threshold=1.335e+03, percent-clipped=3.0 2023-06-26 00:26:12,102 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-26 00:26:19,576 INFO [train.py:996] (3/4) Epoch 8, batch 28550, loss[loss=0.3241, simple_loss=0.4089, pruned_loss=0.1197, over 21532.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2981, pruned_loss=0.07291, over 4270893.30 frames. ], batch size: 471, lr: 3.63e-03, grad_scale: 16.0 2023-06-26 00:26:34,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1452072.0, ans=0.0 2023-06-26 00:27:47,500 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:28:15,688 INFO [train.py:996] (3/4) Epoch 8, batch 28600, loss[loss=0.2458, simple_loss=0.3274, pruned_loss=0.08211, over 21366.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.304, pruned_loss=0.07436, over 4274667.89 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:29:34,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1452552.0, ans=0.1 2023-06-26 00:29:40,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=15.0 2023-06-26 00:29:53,321 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.158e+02 4.451e+02 5.957e+02 7.529e+02 1.462e+03, threshold=1.191e+03, percent-clipped=3.0 2023-06-26 00:30:03,845 INFO [train.py:996] (3/4) Epoch 8, batch 28650, loss[loss=0.2021, simple_loss=0.2638, pruned_loss=0.07023, over 21208.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2988, pruned_loss=0.07428, over 4270332.77 frames. ], batch size: 176, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:30:09,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1452672.0, ans=0.1 2023-06-26 00:30:16,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1452672.0, ans=0.0 2023-06-26 00:30:39,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1452732.0, ans=0.2 2023-06-26 00:30:46,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1452792.0, ans=0.05 2023-06-26 00:31:47,839 INFO [train.py:996] (3/4) Epoch 8, batch 28700, loss[loss=0.2097, simple_loss=0.2901, pruned_loss=0.06468, over 21635.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2983, pruned_loss=0.07556, over 4267695.64 frames. ], batch size: 263, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:32:15,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453032.0, ans=0.1 2023-06-26 00:32:29,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1453092.0, ans=0.0 2023-06-26 00:32:30,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-26 00:33:19,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.264e+02 4.611e+02 5.755e+02 7.778e+02 1.501e+03, threshold=1.151e+03, percent-clipped=4.0 2023-06-26 00:33:30,172 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-26 00:33:30,631 INFO [train.py:996] (3/4) Epoch 8, batch 28750, loss[loss=0.2215, simple_loss=0.3052, pruned_loss=0.06886, over 21819.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2978, pruned_loss=0.07558, over 4269573.58 frames. ], batch size: 414, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:33:33,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453272.0, ans=0.1 2023-06-26 00:33:38,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453272.0, ans=0.1 2023-06-26 00:33:45,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1453272.0, ans=0.125 2023-06-26 00:33:46,628 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1453332.0, ans=0.1 2023-06-26 00:33:49,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=15.0 2023-06-26 00:34:05,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1453332.0, ans=0.0 2023-06-26 00:34:06,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.60 vs. limit=15.0 2023-06-26 00:34:23,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1453392.0, ans=0.0 2023-06-26 00:34:43,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1453452.0, ans=0.125 2023-06-26 00:35:07,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1453512.0, ans=0.125 2023-06-26 00:35:18,762 INFO [train.py:996] (3/4) Epoch 8, batch 28800, loss[loss=0.23, simple_loss=0.3083, pruned_loss=0.07583, over 21629.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3017, pruned_loss=0.07639, over 4274614.54 frames. ], batch size: 263, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:35:44,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1453632.0, ans=0.2 2023-06-26 00:36:55,660 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.079e+02 4.504e+02 5.803e+02 7.798e+02 1.715e+03, threshold=1.161e+03, percent-clipped=9.0 2023-06-26 00:37:06,134 INFO [train.py:996] (3/4) Epoch 8, batch 28850, loss[loss=0.2375, simple_loss=0.301, pruned_loss=0.08697, over 21361.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3018, pruned_loss=0.07723, over 4283299.75 frames. ], batch size: 159, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:37:12,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-26 00:37:38,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1453932.0, ans=0.2 2023-06-26 00:37:58,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1453992.0, ans=0.1 2023-06-26 00:38:26,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1454052.0, ans=0.125 2023-06-26 00:38:27,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-26 00:38:56,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1454172.0, ans=0.125 2023-06-26 00:39:02,755 INFO [train.py:996] (3/4) Epoch 8, batch 28900, loss[loss=0.3054, simple_loss=0.3694, pruned_loss=0.1207, over 21499.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3056, pruned_loss=0.07922, over 4282495.83 frames. ], batch size: 508, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:40:36,716 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.510e+02 4.525e+02 6.150e+02 8.317e+02 2.231e+03, threshold=1.230e+03, percent-clipped=10.0 2023-06-26 00:40:57,584 INFO [train.py:996] (3/4) Epoch 8, batch 28950, loss[loss=0.1979, simple_loss=0.2678, pruned_loss=0.06397, over 21624.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3062, pruned_loss=0.07814, over 4278001.96 frames. ], batch size: 230, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:41:39,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1454532.0, ans=0.125 2023-06-26 00:42:03,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1454652.0, ans=0.1 2023-06-26 00:42:08,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1454652.0, ans=0.125 2023-06-26 00:42:36,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1454712.0, ans=0.1 2023-06-26 00:42:52,294 INFO [train.py:996] (3/4) Epoch 8, batch 29000, loss[loss=0.2577, simple_loss=0.3282, pruned_loss=0.09355, over 21349.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3091, pruned_loss=0.07768, over 4275437.76 frames. ], batch size: 549, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:42:58,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1454772.0, ans=0.125 2023-06-26 00:42:59,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1454772.0, ans=0.125 2023-06-26 00:43:42,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1454892.0, ans=0.0 2023-06-26 00:43:46,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1454892.0, ans=0.0 2023-06-26 00:44:25,389 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.229e+02 4.694e+02 5.564e+02 8.456e+02 2.061e+03, threshold=1.113e+03, percent-clipped=6.0 2023-06-26 00:44:39,539 INFO [train.py:996] (3/4) Epoch 8, batch 29050, loss[loss=0.23, simple_loss=0.299, pruned_loss=0.08054, over 21821.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3069, pruned_loss=0.07795, over 4280984.50 frames. ], batch size: 441, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:45:27,772 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-26 00:45:46,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1455252.0, ans=0.0 2023-06-26 00:46:27,346 INFO [train.py:996] (3/4) Epoch 8, batch 29100, loss[loss=0.2104, simple_loss=0.2627, pruned_loss=0.07903, over 21320.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2978, pruned_loss=0.07545, over 4280461.86 frames. ], batch size: 473, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:46:45,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1455372.0, ans=0.0 2023-06-26 00:47:45,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-26 00:47:58,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1455612.0, ans=0.1 2023-06-26 00:48:06,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.913e+02 4.309e+02 6.273e+02 8.461e+02 1.678e+03, threshold=1.255e+03, percent-clipped=7.0 2023-06-26 00:48:15,359 INFO [train.py:996] (3/4) Epoch 8, batch 29150, loss[loss=0.2235, simple_loss=0.2921, pruned_loss=0.07743, over 20009.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2972, pruned_loss=0.07378, over 4276749.80 frames. ], batch size: 702, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:49:37,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1455852.0, ans=0.0 2023-06-26 00:50:08,237 INFO [train.py:996] (3/4) Epoch 8, batch 29200, loss[loss=0.1926, simple_loss=0.2663, pruned_loss=0.05948, over 21224.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2931, pruned_loss=0.07331, over 4274200.39 frames. ], batch size: 159, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:50:25,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1456032.0, ans=22.5 2023-06-26 00:50:42,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1456092.0, ans=0.125 2023-06-26 00:51:15,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1456152.0, ans=0.1 2023-06-26 00:51:42,012 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.239e+02 4.282e+02 5.514e+02 8.024e+02 1.461e+03, threshold=1.103e+03, percent-clipped=3.0 2023-06-26 00:51:55,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1456272.0, ans=0.125 2023-06-26 00:51:56,519 INFO [train.py:996] (3/4) Epoch 8, batch 29250, loss[loss=0.2015, simple_loss=0.2908, pruned_loss=0.05608, over 21576.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2921, pruned_loss=0.07091, over 4275053.91 frames. ], batch size: 230, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 00:52:08,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1456272.0, ans=0.125 2023-06-26 00:52:13,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1456332.0, ans=0.125 2023-06-26 00:53:12,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1456452.0, ans=0.2 2023-06-26 00:53:19,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1456452.0, ans=0.2 2023-06-26 00:53:22,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456512.0, ans=0.1 2023-06-26 00:53:24,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456512.0, ans=0.1 2023-06-26 00:53:44,020 INFO [train.py:996] (3/4) Epoch 8, batch 29300, loss[loss=0.2141, simple_loss=0.2994, pruned_loss=0.06443, over 21241.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2937, pruned_loss=0.07042, over 4269872.75 frames. ], batch size: 176, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:54:21,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=22.5 2023-06-26 00:54:22,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456692.0, ans=0.1 2023-06-26 00:54:53,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456752.0, ans=0.1 2023-06-26 00:55:25,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.817e+02 4.100e+02 5.558e+02 8.472e+02 2.092e+03, threshold=1.112e+03, percent-clipped=11.0 2023-06-26 00:55:32,606 INFO [train.py:996] (3/4) Epoch 8, batch 29350, loss[loss=0.2037, simple_loss=0.2821, pruned_loss=0.06265, over 21537.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2899, pruned_loss=0.06954, over 4267753.80 frames. ], batch size: 230, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:56:24,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-26 00:56:58,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1457052.0, ans=0.0 2023-06-26 00:57:09,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1457112.0, ans=0.5 2023-06-26 00:57:21,101 INFO [train.py:996] (3/4) Epoch 8, batch 29400, loss[loss=0.119, simple_loss=0.1576, pruned_loss=0.04018, over 16104.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2894, pruned_loss=0.06766, over 4262905.11 frames. ], batch size: 60, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:57:28,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1457172.0, ans=0.1 2023-06-26 00:58:01,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1457232.0, ans=0.2 2023-06-26 00:58:30,035 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.32 vs. limit=15.0 2023-06-26 00:59:02,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.050e+02 4.516e+02 7.158e+02 1.067e+03 2.108e+03, threshold=1.432e+03, percent-clipped=22.0 2023-06-26 00:59:02,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1457412.0, ans=0.125 2023-06-26 00:59:09,195 INFO [train.py:996] (3/4) Epoch 8, batch 29450, loss[loss=0.2493, simple_loss=0.3244, pruned_loss=0.08708, over 21375.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2877, pruned_loss=0.06682, over 4260985.33 frames. ], batch size: 549, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 00:59:30,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1457532.0, ans=0.025 2023-06-26 00:59:32,538 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-26 00:59:36,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1457532.0, ans=0.0 2023-06-26 00:59:50,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1457532.0, ans=0.07 2023-06-26 01:00:12,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1457592.0, ans=0.0 2023-06-26 01:00:12,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1457592.0, ans=0.1 2023-06-26 01:00:56,317 INFO [train.py:996] (3/4) Epoch 8, batch 29500, loss[loss=0.223, simple_loss=0.2822, pruned_loss=0.08195, over 21257.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2932, pruned_loss=0.07054, over 4270166.76 frames. ], batch size: 159, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:01:13,127 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2023-06-26 01:02:02,232 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1457952.0, ans=0.1 2023-06-26 01:02:17,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1457952.0, ans=0.09899494936611666 2023-06-26 01:02:36,142 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.293e+02 4.544e+02 5.932e+02 7.825e+02 1.489e+03, threshold=1.186e+03, percent-clipped=1.0 2023-06-26 01:02:42,877 INFO [train.py:996] (3/4) Epoch 8, batch 29550, loss[loss=0.2353, simple_loss=0.3115, pruned_loss=0.07957, over 22058.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2922, pruned_loss=0.07184, over 4281023.32 frames. ], batch size: 119, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:02:52,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1458072.0, ans=0.125 2023-06-26 01:03:19,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1458132.0, ans=0.1 2023-06-26 01:04:40,148 INFO [train.py:996] (3/4) Epoch 8, batch 29600, loss[loss=0.2816, simple_loss=0.3646, pruned_loss=0.0993, over 21641.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2991, pruned_loss=0.07454, over 4288484.23 frames. ], batch size: 389, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:06:01,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1458552.0, ans=0.05 2023-06-26 01:06:04,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1458612.0, ans=0.2 2023-06-26 01:06:21,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.719e+02 4.529e+02 7.554e+02 1.096e+03 2.697e+03, threshold=1.511e+03, percent-clipped=19.0 2023-06-26 01:06:27,942 INFO [train.py:996] (3/4) Epoch 8, batch 29650, loss[loss=0.1981, simple_loss=0.2693, pruned_loss=0.06348, over 21873.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2983, pruned_loss=0.07118, over 4282555.89 frames. ], batch size: 124, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:06:35,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1458672.0, ans=0.125 2023-06-26 01:07:10,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1458732.0, ans=0.125 2023-06-26 01:07:51,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1458852.0, ans=0.125 2023-06-26 01:08:17,130 INFO [train.py:996] (3/4) Epoch 8, batch 29700, loss[loss=0.2415, simple_loss=0.3219, pruned_loss=0.08056, over 21742.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2993, pruned_loss=0.07129, over 4283169.03 frames. ], batch size: 112, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:09:18,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1459092.0, ans=0.125 2023-06-26 01:09:32,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1459152.0, ans=0.125 2023-06-26 01:09:57,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.231e+02 4.516e+02 5.860e+02 9.248e+02 1.775e+03, threshold=1.172e+03, percent-clipped=6.0 2023-06-26 01:10:04,572 INFO [train.py:996] (3/4) Epoch 8, batch 29750, loss[loss=0.2048, simple_loss=0.322, pruned_loss=0.04376, over 19737.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3043, pruned_loss=0.07076, over 4283590.79 frames. ], batch size: 702, lr: 3.62e-03, grad_scale: 32.0 2023-06-26 01:11:07,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1459392.0, ans=0.1 2023-06-26 01:11:16,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1459452.0, ans=0.125 2023-06-26 01:11:51,249 INFO [train.py:996] (3/4) Epoch 8, batch 29800, loss[loss=0.2856, simple_loss=0.3332, pruned_loss=0.119, over 21789.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3049, pruned_loss=0.07165, over 4287670.46 frames. ], batch size: 508, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:11:51,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1459572.0, ans=0.125 2023-06-26 01:11:53,619 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:12:29,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1459632.0, ans=0.125 2023-06-26 01:12:54,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1459692.0, ans=0.0 2023-06-26 01:12:56,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=22.5 2023-06-26 01:13:04,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1459752.0, ans=0.04949747468305833 2023-06-26 01:13:27,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1459812.0, ans=0.0 2023-06-26 01:13:32,344 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.767e+02 3.928e+02 4.572e+02 6.290e+02 1.025e+03, threshold=9.144e+02, percent-clipped=0.0 2023-06-26 01:13:34,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1459812.0, ans=0.0 2023-06-26 01:13:37,436 INFO [train.py:996] (3/4) Epoch 8, batch 29850, loss[loss=0.1993, simple_loss=0.2735, pruned_loss=0.06259, over 21599.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3004, pruned_loss=0.06948, over 4274882.08 frames. ], batch size: 263, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:13:44,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1459872.0, ans=0.0 2023-06-26 01:14:41,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1460052.0, ans=0.125 2023-06-26 01:15:20,094 INFO [train.py:996] (3/4) Epoch 8, batch 29900, loss[loss=0.2504, simple_loss=0.3147, pruned_loss=0.09305, over 21664.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.299, pruned_loss=0.07068, over 4282567.17 frames. ], batch size: 230, lr: 3.62e-03, grad_scale: 16.0 2023-06-26 01:16:34,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-26 01:16:40,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1460352.0, ans=0.0 2023-06-26 01:17:10,424 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.335e+02 4.671e+02 6.480e+02 9.712e+02 1.710e+03, threshold=1.296e+03, percent-clipped=28.0 2023-06-26 01:17:15,564 INFO [train.py:996] (3/4) Epoch 8, batch 29950, loss[loss=0.2666, simple_loss=0.341, pruned_loss=0.09614, over 21284.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3018, pruned_loss=0.07408, over 4277505.35 frames. ], batch size: 143, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:17:59,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1460592.0, ans=0.0 2023-06-26 01:19:00,373 INFO [train.py:996] (3/4) Epoch 8, batch 30000, loss[loss=0.1999, simple_loss=0.2933, pruned_loss=0.05326, over 21610.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3032, pruned_loss=0.07422, over 4276541.81 frames. ], batch size: 230, lr: 3.61e-03, grad_scale: 32.0 2023-06-26 01:19:00,373 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 01:19:18,797 INFO [train.py:1028] (3/4) Epoch 8, validation: loss=0.2464, simple_loss=0.3452, pruned_loss=0.07378, over 1796401.00 frames. 2023-06-26 01:19:18,798 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 01:19:58,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1460832.0, ans=0.1 2023-06-26 01:19:58,665 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-26 01:21:14,681 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.858e+02 4.174e+02 5.657e+02 7.922e+02 1.669e+03, threshold=1.131e+03, percent-clipped=1.0 2023-06-26 01:21:20,182 INFO [train.py:996] (3/4) Epoch 8, batch 30050, loss[loss=0.2533, simple_loss=0.3599, pruned_loss=0.07332, over 21820.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3064, pruned_loss=0.07181, over 4277249.10 frames. ], batch size: 371, lr: 3.61e-03, grad_scale: 32.0 2023-06-26 01:21:32,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1461072.0, ans=0.1 2023-06-26 01:21:34,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1461072.0, ans=0.125 2023-06-26 01:21:42,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1461132.0, ans=0.0 2023-06-26 01:22:24,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=15.0 2023-06-26 01:22:48,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1461252.0, ans=0.2 2023-06-26 01:23:13,753 INFO [train.py:996] (3/4) Epoch 8, batch 30100, loss[loss=0.1812, simple_loss=0.2503, pruned_loss=0.05603, over 21330.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3046, pruned_loss=0.07138, over 4278404.65 frames. ], batch size: 211, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:23:36,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1461432.0, ans=0.0 2023-06-26 01:24:53,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.090e+02 4.517e+02 6.270e+02 9.720e+02 3.054e+03, threshold=1.254e+03, percent-clipped=16.0 2023-06-26 01:24:57,503 INFO [train.py:996] (3/4) Epoch 8, batch 30150, loss[loss=0.2365, simple_loss=0.3047, pruned_loss=0.08412, over 21591.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3008, pruned_loss=0.07247, over 4267288.39 frames. ], batch size: 230, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:25:06,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1461672.0, ans=0.125 2023-06-26 01:25:33,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1461732.0, ans=0.1 2023-06-26 01:25:35,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=12.0 2023-06-26 01:26:07,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1461852.0, ans=0.125 2023-06-26 01:26:34,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1461912.0, ans=0.0 2023-06-26 01:26:53,766 INFO [train.py:996] (3/4) Epoch 8, batch 30200, loss[loss=0.25, simple_loss=0.3334, pruned_loss=0.08328, over 21767.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3022, pruned_loss=0.07176, over 4262163.23 frames. ], batch size: 441, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:27:03,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1461972.0, ans=0.1 2023-06-26 01:27:31,872 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:27:49,036 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-26 01:28:20,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1462212.0, ans=0.1 2023-06-26 01:28:45,643 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.354e+02 5.048e+02 7.227e+02 1.023e+03 2.150e+03, threshold=1.445e+03, percent-clipped=15.0 2023-06-26 01:28:48,938 INFO [train.py:996] (3/4) Epoch 8, batch 30250, loss[loss=0.2233, simple_loss=0.2959, pruned_loss=0.07535, over 20118.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3095, pruned_loss=0.07373, over 4264541.09 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:29:18,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-26 01:29:48,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1462392.0, ans=0.0 2023-06-26 01:29:56,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.15 vs. limit=22.5 2023-06-26 01:30:36,860 INFO [train.py:996] (3/4) Epoch 8, batch 30300, loss[loss=0.1947, simple_loss=0.2583, pruned_loss=0.06555, over 21880.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3063, pruned_loss=0.07304, over 4267412.35 frames. ], batch size: 373, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:31:00,470 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=12.0 2023-06-26 01:31:08,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1462632.0, ans=0.1 2023-06-26 01:32:02,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1462752.0, ans=0.125 2023-06-26 01:32:11,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1462812.0, ans=0.125 2023-06-26 01:32:31,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.196e+02 5.174e+02 6.761e+02 1.021e+03 2.632e+03, threshold=1.352e+03, percent-clipped=10.0 2023-06-26 01:32:34,766 INFO [train.py:996] (3/4) Epoch 8, batch 30350, loss[loss=0.2033, simple_loss=0.2779, pruned_loss=0.06431, over 21665.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3068, pruned_loss=0.07427, over 4272033.27 frames. ], batch size: 247, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:32:44,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1462872.0, ans=0.05 2023-06-26 01:32:50,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-26 01:33:02,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1462932.0, ans=0.125 2023-06-26 01:33:20,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1462992.0, ans=0.0 2023-06-26 01:33:24,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1463052.0, ans=0.02 2023-06-26 01:33:25,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1463052.0, ans=0.0 2023-06-26 01:33:37,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1463052.0, ans=0.1 2023-06-26 01:33:45,019 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-26 01:33:55,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2023-06-26 01:33:56,336 INFO [train.py:996] (3/4) Epoch 8, batch 30400, loss[loss=0.2161, simple_loss=0.2701, pruned_loss=0.08111, over 20191.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3019, pruned_loss=0.07321, over 4258622.48 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 32.0 2023-06-26 01:34:22,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1463232.0, ans=0.0 2023-06-26 01:34:47,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1463292.0, ans=0.09899494936611666 2023-06-26 01:34:56,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1463352.0, ans=0.0 2023-06-26 01:35:05,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1463352.0, ans=0.0 2023-06-26 01:35:24,303 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.064e+02 6.383e+02 1.075e+03 1.632e+03 7.193e+03, threshold=2.149e+03, percent-clipped=36.0 2023-06-26 01:35:25,749 INFO [train.py:996] (3/4) Epoch 8, batch 30450, loss[loss=0.2642, simple_loss=0.3808, pruned_loss=0.07385, over 19883.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.303, pruned_loss=0.07317, over 4200107.87 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-26 01:35:37,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-06-26 01:38:50,977 INFO [train.py:996] (3/4) Epoch 9, batch 0, loss[loss=0.2331, simple_loss=0.2882, pruned_loss=0.08901, over 21357.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.2882, pruned_loss=0.08901, over 21357.00 frames. ], batch size: 473, lr: 3.39e-03, grad_scale: 32.0 2023-06-26 01:38:50,978 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 01:39:14,232 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2395, simple_loss=0.3459, pruned_loss=0.06656, over 1796401.00 frames. 2023-06-26 01:39:14,233 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 01:39:27,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1463742.0, ans=0.125 2023-06-26 01:40:02,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-26 01:40:07,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1463862.0, ans=0.0 2023-06-26 01:40:27,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1463922.0, ans=0.0 2023-06-26 01:40:30,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1463922.0, ans=0.05 2023-06-26 01:40:59,318 INFO [train.py:996] (3/4) Epoch 9, batch 50, loss[loss=0.2744, simple_loss=0.3513, pruned_loss=0.09873, over 21627.00 frames. ], tot_loss[loss=0.23, simple_loss=0.311, pruned_loss=0.07446, over 965733.42 frames. ], batch size: 414, lr: 3.39e-03, grad_scale: 16.0 2023-06-26 01:41:03,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1464042.0, ans=0.125 2023-06-26 01:41:13,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.197e+02 4.855e+02 1.072e+03 2.293e+03 5.497e+03, threshold=2.144e+03, percent-clipped=28.0 2023-06-26 01:41:14,730 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.75 vs. limit=22.5 2023-06-26 01:41:32,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1464102.0, ans=15.0 2023-06-26 01:41:37,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.53 vs. limit=15.0 2023-06-26 01:42:00,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1464162.0, ans=0.2 2023-06-26 01:42:24,445 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:42:34,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1464282.0, ans=0.125 2023-06-26 01:42:40,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1464342.0, ans=0.0 2023-06-26 01:42:40,947 INFO [train.py:996] (3/4) Epoch 9, batch 100, loss[loss=0.2554, simple_loss=0.3461, pruned_loss=0.08232, over 21249.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3275, pruned_loss=0.07767, over 1707375.60 frames. ], batch size: 143, lr: 3.39e-03, grad_scale: 16.0 2023-06-26 01:43:12,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1464402.0, ans=0.0 2023-06-26 01:43:34,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1464462.0, ans=0.0 2023-06-26 01:43:38,279 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-26 01:43:43,505 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.73 vs. limit=10.0 2023-06-26 01:43:44,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1464462.0, ans=0.0 2023-06-26 01:43:54,923 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:44:11,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1464582.0, ans=0.2 2023-06-26 01:44:26,093 INFO [train.py:996] (3/4) Epoch 9, batch 150, loss[loss=0.2541, simple_loss=0.3336, pruned_loss=0.0873, over 21859.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3271, pruned_loss=0.0767, over 2283809.99 frames. ], batch size: 118, lr: 3.39e-03, grad_scale: 16.0 2023-06-26 01:44:32,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-26 01:44:40,676 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.415e+02 5.834e+02 7.944e+02 1.480e+03, threshold=1.167e+03, percent-clipped=0.0 2023-06-26 01:44:50,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1464642.0, ans=0.0 2023-06-26 01:45:23,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1464762.0, ans=0.125 2023-06-26 01:45:30,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1464762.0, ans=0.2 2023-06-26 01:45:40,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1464822.0, ans=0.0 2023-06-26 01:45:50,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1464822.0, ans=0.125 2023-06-26 01:45:52,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-26 01:46:00,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1464882.0, ans=0.0 2023-06-26 01:46:13,197 INFO [train.py:996] (3/4) Epoch 9, batch 200, loss[loss=0.2128, simple_loss=0.2847, pruned_loss=0.07047, over 21906.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3209, pruned_loss=0.07454, over 2727907.36 frames. ], batch size: 98, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:46:53,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1465002.0, ans=0.95 2023-06-26 01:48:00,432 INFO [train.py:996] (3/4) Epoch 9, batch 250, loss[loss=0.2617, simple_loss=0.3359, pruned_loss=0.09377, over 21593.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3143, pruned_loss=0.07338, over 3066796.35 frames. ], batch size: 389, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:48:08,373 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.60 vs. limit=15.0 2023-06-26 01:48:08,788 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.143e+02 4.378e+02 6.069e+02 8.721e+02 1.562e+03, threshold=1.214e+03, percent-clipped=10.0 2023-06-26 01:48:15,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1465242.0, ans=0.125 2023-06-26 01:48:31,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1465302.0, ans=0.2 2023-06-26 01:49:08,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1465362.0, ans=0.0 2023-06-26 01:49:12,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1465362.0, ans=0.2 2023-06-26 01:49:40,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1465482.0, ans=0.125 2023-06-26 01:49:50,402 INFO [train.py:996] (3/4) Epoch 9, batch 300, loss[loss=0.1926, simple_loss=0.2614, pruned_loss=0.06194, over 21301.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3081, pruned_loss=0.07242, over 3333990.08 frames. ], batch size: 131, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:50:10,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1465542.0, ans=0.125 2023-06-26 01:50:32,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1465602.0, ans=0.0 2023-06-26 01:50:56,100 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-26 01:51:32,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1465782.0, ans=0.0 2023-06-26 01:51:38,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1465782.0, ans=0.2 2023-06-26 01:51:41,269 INFO [train.py:996] (3/4) Epoch 9, batch 350, loss[loss=0.23, simple_loss=0.3029, pruned_loss=0.07853, over 20771.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3015, pruned_loss=0.06985, over 3540550.15 frames. ], batch size: 609, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:51:43,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1465842.0, ans=0.125 2023-06-26 01:51:50,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.975e+02 4.637e+02 6.282e+02 9.202e+02 1.945e+03, threshold=1.256e+03, percent-clipped=12.0 2023-06-26 01:52:24,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1465902.0, ans=0.125 2023-06-26 01:53:03,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1466022.0, ans=0.125 2023-06-26 01:53:27,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1466082.0, ans=0.07 2023-06-26 01:53:30,986 INFO [train.py:996] (3/4) Epoch 9, batch 400, loss[loss=0.2327, simple_loss=0.286, pruned_loss=0.08966, over 21263.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2956, pruned_loss=0.06913, over 3708314.66 frames. ], batch size: 471, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:54:31,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1466262.0, ans=10.0 2023-06-26 01:54:43,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1466322.0, ans=0.5 2023-06-26 01:54:49,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1466322.0, ans=0.125 2023-06-26 01:54:51,782 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.35 vs. limit=10.0 2023-06-26 01:55:02,089 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-26 01:55:14,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1466382.0, ans=0.125 2023-06-26 01:55:16,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1466382.0, ans=0.125 2023-06-26 01:55:20,920 INFO [train.py:996] (3/4) Epoch 9, batch 450, loss[loss=0.1856, simple_loss=0.2521, pruned_loss=0.05955, over 21433.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.293, pruned_loss=0.06899, over 3838389.32 frames. ], batch size: 212, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:55:22,291 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-06-26 01:55:28,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1466442.0, ans=0.125 2023-06-26 01:55:41,295 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.313e+02 4.889e+02 7.953e+02 1.170e+03 2.853e+03, threshold=1.591e+03, percent-clipped=21.0 2023-06-26 01:55:42,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1466442.0, ans=0.125 2023-06-26 01:55:42,209 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:55:56,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1466502.0, ans=0.125 2023-06-26 01:56:52,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1466682.0, ans=0.125 2023-06-26 01:57:12,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1466742.0, ans=0.0 2023-06-26 01:57:13,980 INFO [train.py:996] (3/4) Epoch 9, batch 500, loss[loss=0.1765, simple_loss=0.2331, pruned_loss=0.05995, over 20815.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2923, pruned_loss=0.06812, over 3929967.05 frames. ], batch size: 608, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:58:14,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1466862.0, ans=0.125 2023-06-26 01:58:33,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1466922.0, ans=0.0 2023-06-26 01:58:37,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1466922.0, ans=0.125 2023-06-26 01:59:07,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1467042.0, ans=0.1 2023-06-26 01:59:08,332 INFO [train.py:996] (3/4) Epoch 9, batch 550, loss[loss=0.2104, simple_loss=0.2875, pruned_loss=0.06662, over 21888.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2959, pruned_loss=0.0682, over 4004298.65 frames. ], batch size: 118, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 01:59:25,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 4.595e+02 7.824e+02 1.104e+03 2.417e+03, threshold=1.565e+03, percent-clipped=11.0 2023-06-26 02:01:03,323 INFO [train.py:996] (3/4) Epoch 9, batch 600, loss[loss=0.19, simple_loss=0.2657, pruned_loss=0.05718, over 21736.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3005, pruned_loss=0.06872, over 4067289.01 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:01:10,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1467342.0, ans=0.025 2023-06-26 02:01:38,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1467402.0, ans=0.1 2023-06-26 02:01:49,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1467462.0, ans=0.125 2023-06-26 02:02:23,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1467582.0, ans=0.0 2023-06-26 02:02:47,038 INFO [train.py:996] (3/4) Epoch 9, batch 650, loss[loss=0.2489, simple_loss=0.3586, pruned_loss=0.06961, over 21647.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3039, pruned_loss=0.07041, over 4109309.89 frames. ], batch size: 441, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:03:03,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.173e+02 5.371e+02 7.433e+02 1.361e+03 3.228e+03, threshold=1.487e+03, percent-clipped=18.0 2023-06-26 02:03:26,589 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:03:39,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1467762.0, ans=0.125 2023-06-26 02:03:43,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1467762.0, ans=0.0 2023-06-26 02:04:15,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1467882.0, ans=0.2 2023-06-26 02:04:44,101 INFO [train.py:996] (3/4) Epoch 9, batch 700, loss[loss=0.1994, simple_loss=0.2685, pruned_loss=0.06515, over 21661.00 frames. ], tot_loss[loss=0.222, simple_loss=0.302, pruned_loss=0.07102, over 4152321.04 frames. ], batch size: 230, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:04:48,847 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.88 vs. limit=15.0 2023-06-26 02:05:01,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-26 02:05:43,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1468122.0, ans=0.125 2023-06-26 02:06:02,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1468182.0, ans=0.125 2023-06-26 02:06:07,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1468182.0, ans=0.125 2023-06-26 02:06:31,665 INFO [train.py:996] (3/4) Epoch 9, batch 750, loss[loss=0.2571, simple_loss=0.344, pruned_loss=0.08507, over 21724.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2995, pruned_loss=0.07157, over 4192561.77 frames. ], batch size: 298, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:06:42,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 4.754e+02 6.417e+02 9.585e+02 1.882e+03, threshold=1.283e+03, percent-clipped=6.0 2023-06-26 02:07:09,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-26 02:07:42,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1468422.0, ans=15.0 2023-06-26 02:07:51,095 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:08:10,189 INFO [train.py:996] (3/4) Epoch 9, batch 800, loss[loss=0.1926, simple_loss=0.2647, pruned_loss=0.0602, over 21780.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2974, pruned_loss=0.07144, over 4219659.08 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:09:40,629 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:09:55,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-26 02:10:10,647 INFO [train.py:996] (3/4) Epoch 9, batch 850, loss[loss=0.2049, simple_loss=0.2822, pruned_loss=0.06377, over 21909.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.296, pruned_loss=0.07141, over 4245519.32 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:10:26,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.492e+02 5.225e+02 7.900e+02 1.161e+03 2.208e+03, threshold=1.580e+03, percent-clipped=19.0 2023-06-26 02:10:28,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1468842.0, ans=0.0 2023-06-26 02:10:39,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1468902.0, ans=0.0 2023-06-26 02:11:21,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1469022.0, ans=0.125 2023-06-26 02:11:24,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-26 02:11:59,350 INFO [train.py:996] (3/4) Epoch 9, batch 900, loss[loss=0.211, simple_loss=0.2903, pruned_loss=0.06581, over 21825.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2927, pruned_loss=0.07101, over 4257080.66 frames. ], batch size: 282, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:13:48,838 INFO [train.py:996] (3/4) Epoch 9, batch 950, loss[loss=0.2467, simple_loss=0.3194, pruned_loss=0.08699, over 21924.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2918, pruned_loss=0.07037, over 4266129.07 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:13:51,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1469442.0, ans=0.125 2023-06-26 02:13:55,274 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-26 02:14:01,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.026e+02 4.404e+02 7.084e+02 1.100e+03 2.197e+03, threshold=1.417e+03, percent-clipped=5.0 2023-06-26 02:14:27,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1469562.0, ans=0.0 2023-06-26 02:14:41,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1469622.0, ans=0.125 2023-06-26 02:14:48,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1469622.0, ans=0.2 2023-06-26 02:15:04,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1469682.0, ans=0.0 2023-06-26 02:15:06,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-06-26 02:15:36,787 INFO [train.py:996] (3/4) Epoch 9, batch 1000, loss[loss=0.2036, simple_loss=0.2758, pruned_loss=0.06575, over 21286.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2915, pruned_loss=0.07003, over 4273276.18 frames. ], batch size: 144, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:16:02,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1469802.0, ans=0.09899494936611666 2023-06-26 02:16:07,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1469802.0, ans=0.0 2023-06-26 02:16:30,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1469922.0, ans=0.125 2023-06-26 02:17:06,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1469982.0, ans=0.125 2023-06-26 02:17:22,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1469982.0, ans=0.125 2023-06-26 02:17:27,472 INFO [train.py:996] (3/4) Epoch 9, batch 1050, loss[loss=0.2322, simple_loss=0.3157, pruned_loss=0.07434, over 21612.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.293, pruned_loss=0.07005, over 4276925.23 frames. ], batch size: 389, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:17:39,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.206e+02 4.347e+02 6.082e+02 9.446e+02 2.534e+03, threshold=1.216e+03, percent-clipped=8.0 2023-06-26 02:19:18,794 INFO [train.py:996] (3/4) Epoch 9, batch 1100, loss[loss=0.2283, simple_loss=0.3008, pruned_loss=0.07792, over 21641.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2938, pruned_loss=0.06971, over 4277617.59 frames. ], batch size: 441, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:19:28,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1470342.0, ans=0.2 2023-06-26 02:19:48,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.30 vs. limit=6.0 2023-06-26 02:21:09,477 INFO [train.py:996] (3/4) Epoch 9, batch 1150, loss[loss=0.2315, simple_loss=0.3251, pruned_loss=0.06889, over 21659.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2951, pruned_loss=0.06996, over 4278934.05 frames. ], batch size: 389, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:21:12,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1470642.0, ans=10.0 2023-06-26 02:21:22,269 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 4.817e+02 6.167e+02 1.033e+03 2.052e+03, threshold=1.233e+03, percent-clipped=13.0 2023-06-26 02:21:57,723 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-26 02:22:26,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1470822.0, ans=0.035 2023-06-26 02:23:00,148 INFO [train.py:996] (3/4) Epoch 9, batch 1200, loss[loss=0.2194, simple_loss=0.2963, pruned_loss=0.07127, over 21673.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2954, pruned_loss=0.07013, over 4278142.92 frames. ], batch size: 231, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:23:09,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1470942.0, ans=0.2 2023-06-26 02:24:52,845 INFO [train.py:996] (3/4) Epoch 9, batch 1250, loss[loss=0.1977, simple_loss=0.2623, pruned_loss=0.06654, over 21185.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2959, pruned_loss=0.07044, over 4275510.06 frames. ], batch size: 608, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:25:06,490 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.265e+02 4.578e+02 6.578e+02 9.426e+02 2.383e+03, threshold=1.316e+03, percent-clipped=14.0 2023-06-26 02:26:43,206 INFO [train.py:996] (3/4) Epoch 9, batch 1300, loss[loss=0.1836, simple_loss=0.2696, pruned_loss=0.04878, over 21482.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2969, pruned_loss=0.07091, over 4280319.58 frames. ], batch size: 211, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:26:49,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1471542.0, ans=0.125 2023-06-26 02:26:54,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1471542.0, ans=0.0 2023-06-26 02:27:02,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1471602.0, ans=0.0 2023-06-26 02:27:18,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1471602.0, ans=0.0 2023-06-26 02:27:43,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1471662.0, ans=0.125 2023-06-26 02:28:32,844 INFO [train.py:996] (3/4) Epoch 9, batch 1350, loss[loss=0.2154, simple_loss=0.2877, pruned_loss=0.07159, over 21471.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2983, pruned_loss=0.07113, over 4280472.05 frames. ], batch size: 131, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:28:46,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.596e+02 4.887e+02 7.409e+02 1.206e+03 1.964e+03, threshold=1.482e+03, percent-clipped=19.0 2023-06-26 02:29:09,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-26 02:29:19,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1471962.0, ans=0.0 2023-06-26 02:30:18,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1472082.0, ans=0.95 2023-06-26 02:30:22,908 INFO [train.py:996] (3/4) Epoch 9, batch 1400, loss[loss=0.2267, simple_loss=0.3145, pruned_loss=0.06944, over 21816.00 frames. ], tot_loss[loss=0.22, simple_loss=0.298, pruned_loss=0.07103, over 4286494.72 frames. ], batch size: 298, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:30:59,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1472202.0, ans=0.0 2023-06-26 02:31:24,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1472262.0, ans=0.0 2023-06-26 02:32:13,541 INFO [train.py:996] (3/4) Epoch 9, batch 1450, loss[loss=0.2429, simple_loss=0.3201, pruned_loss=0.08284, over 21898.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2966, pruned_loss=0.07091, over 4283859.33 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:32:14,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1472442.0, ans=0.2 2023-06-26 02:32:17,875 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=15.0 2023-06-26 02:32:27,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.492e+02 5.469e+02 8.336e+02 1.169e+03 2.052e+03, threshold=1.667e+03, percent-clipped=11.0 2023-06-26 02:33:07,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1472562.0, ans=0.125 2023-06-26 02:33:20,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1472622.0, ans=0.2 2023-06-26 02:33:33,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1472622.0, ans=0.2 2023-06-26 02:33:57,834 INFO [train.py:996] (3/4) Epoch 9, batch 1500, loss[loss=0.2267, simple_loss=0.3012, pruned_loss=0.07606, over 21672.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.298, pruned_loss=0.07178, over 4277946.22 frames. ], batch size: 298, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:34:05,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1472742.0, ans=0.125 2023-06-26 02:35:12,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1472922.0, ans=0.125 2023-06-26 02:35:24,329 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-26 02:35:44,372 INFO [train.py:996] (3/4) Epoch 9, batch 1550, loss[loss=0.2107, simple_loss=0.2807, pruned_loss=0.07039, over 21637.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2959, pruned_loss=0.07051, over 4282399.41 frames. ], batch size: 263, lr: 3.38e-03, grad_scale: 16.0 2023-06-26 02:35:58,923 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.109e+02 4.360e+02 5.874e+02 7.765e+02 1.799e+03, threshold=1.175e+03, percent-clipped=2.0 2023-06-26 02:36:10,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1473102.0, ans=0.0 2023-06-26 02:37:13,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1473222.0, ans=0.0 2023-06-26 02:37:28,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1473282.0, ans=0.125 2023-06-26 02:37:35,447 INFO [train.py:996] (3/4) Epoch 9, batch 1600, loss[loss=0.1874, simple_loss=0.2602, pruned_loss=0.05733, over 21817.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2945, pruned_loss=0.07, over 4287651.44 frames. ], batch size: 282, lr: 3.38e-03, grad_scale: 32.0 2023-06-26 02:38:31,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1473462.0, ans=0.125 2023-06-26 02:39:00,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1473522.0, ans=0.07 2023-06-26 02:39:21,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1473642.0, ans=0.125 2023-06-26 02:39:22,830 INFO [train.py:996] (3/4) Epoch 9, batch 1650, loss[loss=0.2156, simple_loss=0.3003, pruned_loss=0.0655, over 21779.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2938, pruned_loss=0.06986, over 4274109.84 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:39:38,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1473642.0, ans=0.05 2023-06-26 02:39:56,129 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.260e+02 4.603e+02 6.235e+02 9.034e+02 1.719e+03, threshold=1.247e+03, percent-clipped=11.0 2023-06-26 02:40:36,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1473822.0, ans=0.1 2023-06-26 02:40:40,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1473822.0, ans=0.09899494936611666 2023-06-26 02:41:04,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1473882.0, ans=0.125 2023-06-26 02:41:11,342 INFO [train.py:996] (3/4) Epoch 9, batch 1700, loss[loss=0.2267, simple_loss=0.3015, pruned_loss=0.07595, over 21742.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2954, pruned_loss=0.07112, over 4281587.43 frames. ], batch size: 298, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:43:10,708 INFO [train.py:996] (3/4) Epoch 9, batch 1750, loss[loss=0.2419, simple_loss=0.3371, pruned_loss=0.07337, over 19856.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2974, pruned_loss=0.07001, over 4280511.69 frames. ], batch size: 703, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:43:19,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1474242.0, ans=0.2 2023-06-26 02:43:26,590 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.127e+02 4.603e+02 7.165e+02 1.089e+03 2.171e+03, threshold=1.433e+03, percent-clipped=16.0 2023-06-26 02:43:34,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1474302.0, ans=0.0 2023-06-26 02:44:03,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1474362.0, ans=0.125 2023-06-26 02:44:08,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1474422.0, ans=0.125 2023-06-26 02:44:31,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1474422.0, ans=0.125 2023-06-26 02:44:59,051 INFO [train.py:996] (3/4) Epoch 9, batch 1800, loss[loss=0.2149, simple_loss=0.3149, pruned_loss=0.05747, over 21638.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2969, pruned_loss=0.0684, over 4274103.17 frames. ], batch size: 414, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:45:06,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1474542.0, ans=0.0 2023-06-26 02:45:35,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1474662.0, ans=0.125 2023-06-26 02:45:40,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1474662.0, ans=0.1 2023-06-26 02:46:36,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1474782.0, ans=0.0 2023-06-26 02:46:49,564 INFO [train.py:996] (3/4) Epoch 9, batch 1850, loss[loss=0.243, simple_loss=0.3435, pruned_loss=0.07125, over 21528.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2996, pruned_loss=0.0674, over 4279068.02 frames. ], batch size: 471, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:47:07,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.277e+02 4.370e+02 7.147e+02 9.387e+02 1.947e+03, threshold=1.429e+03, percent-clipped=4.0 2023-06-26 02:47:07,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1474902.0, ans=0.125 2023-06-26 02:47:20,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1474902.0, ans=0.125 2023-06-26 02:47:27,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1474962.0, ans=0.125 2023-06-26 02:47:28,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-26 02:48:27,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1475082.0, ans=0.05 2023-06-26 02:48:35,239 INFO [train.py:996] (3/4) Epoch 9, batch 1900, loss[loss=0.1916, simple_loss=0.2641, pruned_loss=0.0596, over 21410.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2991, pruned_loss=0.06758, over 4274176.43 frames. ], batch size: 194, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:48:55,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1475202.0, ans=0.025 2023-06-26 02:48:58,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1475202.0, ans=0.125 2023-06-26 02:49:52,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1475322.0, ans=0.125 2023-06-26 02:50:22,018 INFO [train.py:996] (3/4) Epoch 9, batch 1950, loss[loss=0.2129, simple_loss=0.3173, pruned_loss=0.05424, over 21651.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2959, pruned_loss=0.06721, over 4277740.30 frames. ], batch size: 414, lr: 3.37e-03, grad_scale: 8.0 2023-06-26 02:50:39,716 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.119e+02 4.600e+02 6.101e+02 9.329e+02 1.931e+03, threshold=1.220e+03, percent-clipped=7.0 2023-06-26 02:52:00,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1475682.0, ans=0.125 2023-06-26 02:52:07,662 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-26 02:52:08,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1475682.0, ans=0.0 2023-06-26 02:52:13,509 INFO [train.py:996] (3/4) Epoch 9, batch 2000, loss[loss=0.2229, simple_loss=0.3092, pruned_loss=0.06828, over 20008.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2912, pruned_loss=0.06609, over 4269006.82 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:53:29,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1475922.0, ans=0.95 2023-06-26 02:53:33,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1475922.0, ans=0.125 2023-06-26 02:53:58,858 INFO [train.py:996] (3/4) Epoch 9, batch 2050, loss[loss=0.2264, simple_loss=0.3111, pruned_loss=0.07083, over 21606.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2924, pruned_loss=0.06604, over 4272123.55 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:54:07,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1476042.0, ans=0.2 2023-06-26 02:54:16,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.128e+02 5.298e+02 7.792e+02 1.006e+03 2.094e+03, threshold=1.558e+03, percent-clipped=16.0 2023-06-26 02:54:18,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1476102.0, ans=0.125 2023-06-26 02:54:26,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-26 02:55:26,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1476222.0, ans=0.125 2023-06-26 02:55:53,044 INFO [train.py:996] (3/4) Epoch 9, batch 2100, loss[loss=0.2304, simple_loss=0.3071, pruned_loss=0.07688, over 21311.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2927, pruned_loss=0.06732, over 4272633.80 frames. ], batch size: 176, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:56:14,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1476402.0, ans=0.0 2023-06-26 02:57:35,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1476582.0, ans=0.0 2023-06-26 02:57:44,905 INFO [train.py:996] (3/4) Epoch 9, batch 2150, loss[loss=0.2208, simple_loss=0.3069, pruned_loss=0.06734, over 21226.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2947, pruned_loss=0.06911, over 4264714.30 frames. ], batch size: 176, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 02:57:53,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1476642.0, ans=0.1 2023-06-26 02:57:54,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1476642.0, ans=0.125 2023-06-26 02:58:02,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.265e+02 5.087e+02 7.506e+02 1.094e+03 2.833e+03, threshold=1.501e+03, percent-clipped=11.0 2023-06-26 02:58:56,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-26 02:59:16,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1476882.0, ans=0.07 2023-06-26 02:59:31,672 INFO [train.py:996] (3/4) Epoch 9, batch 2200, loss[loss=0.2111, simple_loss=0.273, pruned_loss=0.07461, over 21683.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2953, pruned_loss=0.06899, over 4265932.36 frames. ], batch size: 417, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:00:27,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1477062.0, ans=0.0 2023-06-26 03:01:15,202 INFO [train.py:996] (3/4) Epoch 9, batch 2250, loss[loss=0.1982, simple_loss=0.2672, pruned_loss=0.06464, over 21737.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2935, pruned_loss=0.0672, over 4264996.92 frames. ], batch size: 351, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:01:30,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.02 vs. limit=15.0 2023-06-26 03:01:32,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.139e+02 4.755e+02 7.951e+02 1.208e+03 2.238e+03, threshold=1.590e+03, percent-clipped=7.0 2023-06-26 03:02:25,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-26 03:03:05,225 INFO [train.py:996] (3/4) Epoch 9, batch 2300, loss[loss=0.2408, simple_loss=0.2831, pruned_loss=0.09925, over 21538.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2913, pruned_loss=0.06748, over 4271530.36 frames. ], batch size: 512, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:04:20,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1477722.0, ans=0.0 2023-06-26 03:04:51,416 INFO [train.py:996] (3/4) Epoch 9, batch 2350, loss[loss=0.228, simple_loss=0.3134, pruned_loss=0.07129, over 19804.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2888, pruned_loss=0.06722, over 4266963.83 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:04:57,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1477842.0, ans=0.1 2023-06-26 03:04:57,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1477842.0, ans=0.125 2023-06-26 03:04:59,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1477842.0, ans=0.1 2023-06-26 03:05:15,071 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.161e+02 4.711e+02 6.334e+02 1.025e+03 2.139e+03, threshold=1.267e+03, percent-clipped=9.0 2023-06-26 03:06:18,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1478022.0, ans=0.125 2023-06-26 03:06:43,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1478142.0, ans=0.125 2023-06-26 03:06:44,902 INFO [train.py:996] (3/4) Epoch 9, batch 2400, loss[loss=0.2873, simple_loss=0.3475, pruned_loss=0.1135, over 21437.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2929, pruned_loss=0.06926, over 4265469.64 frames. ], batch size: 471, lr: 3.37e-03, grad_scale: 32.0 2023-06-26 03:06:47,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1478142.0, ans=0.0 2023-06-26 03:07:45,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1478262.0, ans=0.125 2023-06-26 03:08:36,724 INFO [train.py:996] (3/4) Epoch 9, batch 2450, loss[loss=0.2659, simple_loss=0.3411, pruned_loss=0.0953, over 21236.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2979, pruned_loss=0.07272, over 4263002.46 frames. ], batch size: 159, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:09:01,662 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.534e+02 5.033e+02 6.854e+02 1.116e+03 2.187e+03, threshold=1.371e+03, percent-clipped=18.0 2023-06-26 03:09:02,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1478502.0, ans=0.125 2023-06-26 03:09:35,409 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.16 vs. limit=22.5 2023-06-26 03:09:45,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1478622.0, ans=0.125 2023-06-26 03:10:21,186 INFO [train.py:996] (3/4) Epoch 9, batch 2500, loss[loss=0.1894, simple_loss=0.2588, pruned_loss=0.06001, over 21612.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2942, pruned_loss=0.07165, over 4262664.92 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:10:34,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1478742.0, ans=0.0 2023-06-26 03:11:18,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1478862.0, ans=0.125 2023-06-26 03:12:05,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1479042.0, ans=0.5 2023-06-26 03:12:06,824 INFO [train.py:996] (3/4) Epoch 9, batch 2550, loss[loss=0.1756, simple_loss=0.2911, pruned_loss=0.03002, over 20777.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2924, pruned_loss=0.07005, over 4270159.56 frames. ], batch size: 608, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:12:17,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-26 03:12:37,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.177e+02 4.403e+02 6.951e+02 9.882e+02 2.721e+03, threshold=1.390e+03, percent-clipped=12.0 2023-06-26 03:13:57,227 INFO [train.py:996] (3/4) Epoch 9, batch 2600, loss[loss=0.2469, simple_loss=0.3196, pruned_loss=0.08711, over 21862.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2953, pruned_loss=0.07107, over 4273164.99 frames. ], batch size: 107, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:13:57,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1479342.0, ans=0.125 2023-06-26 03:14:11,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-26 03:15:03,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1479462.0, ans=0.0 2023-06-26 03:15:30,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1479582.0, ans=0.05 2023-06-26 03:15:43,478 INFO [train.py:996] (3/4) Epoch 9, batch 2650, loss[loss=0.2036, simple_loss=0.2973, pruned_loss=0.05497, over 21350.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2945, pruned_loss=0.07055, over 4274981.06 frames. ], batch size: 211, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:16:05,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1479702.0, ans=0.2 2023-06-26 03:16:14,201 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.480e+02 5.388e+02 7.988e+02 1.143e+03 2.285e+03, threshold=1.598e+03, percent-clipped=12.0 2023-06-26 03:16:44,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1479762.0, ans=0.125 2023-06-26 03:17:29,020 INFO [train.py:996] (3/4) Epoch 9, batch 2700, loss[loss=0.273, simple_loss=0.3421, pruned_loss=0.102, over 21554.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2927, pruned_loss=0.0701, over 4282730.81 frames. ], batch size: 509, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:17:55,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1480002.0, ans=0.04949747468305833 2023-06-26 03:19:19,992 INFO [train.py:996] (3/4) Epoch 9, batch 2750, loss[loss=0.2108, simple_loss=0.283, pruned_loss=0.06932, over 21938.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2926, pruned_loss=0.07007, over 4287892.36 frames. ], batch size: 316, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:19:22,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1480242.0, ans=0.125 2023-06-26 03:19:51,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.311e+02 4.494e+02 5.812e+02 9.696e+02 2.134e+03, threshold=1.162e+03, percent-clipped=3.0 2023-06-26 03:19:54,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1480302.0, ans=0.1 2023-06-26 03:20:18,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1480362.0, ans=0.05 2023-06-26 03:20:20,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1480362.0, ans=0.1 2023-06-26 03:21:12,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1480482.0, ans=0.125 2023-06-26 03:21:19,556 INFO [train.py:996] (3/4) Epoch 9, batch 2800, loss[loss=0.2386, simple_loss=0.31, pruned_loss=0.08359, over 21623.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2966, pruned_loss=0.07117, over 4289056.83 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 32.0 2023-06-26 03:22:05,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1480602.0, ans=0.1 2023-06-26 03:22:20,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1480662.0, ans=0.0 2023-06-26 03:22:22,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1480662.0, ans=0.125 2023-06-26 03:23:18,685 INFO [train.py:996] (3/4) Epoch 9, batch 2850, loss[loss=0.1857, simple_loss=0.2602, pruned_loss=0.05559, over 21740.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2975, pruned_loss=0.07203, over 4289747.44 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:23:45,695 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.704e+02 5.417e+02 7.792e+02 1.299e+03 2.553e+03, threshold=1.558e+03, percent-clipped=28.0 2023-06-26 03:23:52,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.56 vs. limit=22.5 2023-06-26 03:25:03,424 INFO [train.py:996] (3/4) Epoch 9, batch 2900, loss[loss=0.2256, simple_loss=0.3274, pruned_loss=0.06194, over 19772.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2937, pruned_loss=0.07111, over 4289655.27 frames. ], batch size: 702, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:25:11,823 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-26 03:25:51,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1481262.0, ans=0.125 2023-06-26 03:26:27,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1481382.0, ans=0.125 2023-06-26 03:26:30,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1481382.0, ans=0.0 2023-06-26 03:26:47,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1481382.0, ans=0.1 2023-06-26 03:26:53,602 INFO [train.py:996] (3/4) Epoch 9, batch 2950, loss[loss=0.2101, simple_loss=0.3066, pruned_loss=0.05679, over 21819.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2944, pruned_loss=0.07084, over 4288810.25 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:26:54,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1481442.0, ans=0.125 2023-06-26 03:27:18,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1481502.0, ans=0.125 2023-06-26 03:27:21,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.283e+02 4.507e+02 5.801e+02 9.754e+02 1.778e+03, threshold=1.160e+03, percent-clipped=2.0 2023-06-26 03:27:41,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1481562.0, ans=10.0 2023-06-26 03:28:38,638 INFO [train.py:996] (3/4) Epoch 9, batch 3000, loss[loss=0.2295, simple_loss=0.3097, pruned_loss=0.07471, over 21751.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2996, pruned_loss=0.07223, over 4291906.92 frames. ], batch size: 298, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:28:38,639 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 03:28:52,910 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.1379, 2.6317, 2.6284, 3.1990, 1.7493, 2.9298, 2.8617, 2.1340], device='cuda:3') 2023-06-26 03:28:52,921 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.1058, 3.5590, 3.1150, 2.1245], device='cuda:3') 2023-06-26 03:29:01,197 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2514, simple_loss=0.3427, pruned_loss=0.08003, over 1796401.00 frames. 2023-06-26 03:29:01,198 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 03:29:04,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-26 03:29:06,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1481742.0, ans=0.0 2023-06-26 03:29:07,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1481742.0, ans=0.125 2023-06-26 03:29:45,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1481862.0, ans=0.0 2023-06-26 03:30:38,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-06-26 03:30:48,463 INFO [train.py:996] (3/4) Epoch 9, batch 3050, loss[loss=0.2109, simple_loss=0.3079, pruned_loss=0.057, over 21178.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2994, pruned_loss=0.07044, over 4291975.39 frames. ], batch size: 548, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:31:06,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1482102.0, ans=0.05 2023-06-26 03:31:09,634 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 4.684e+02 7.478e+02 1.068e+03 1.857e+03, threshold=1.496e+03, percent-clipped=20.0 2023-06-26 03:31:25,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1482162.0, ans=0.1 2023-06-26 03:31:58,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1482222.0, ans=0.0 2023-06-26 03:31:58,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-26 03:32:42,241 INFO [train.py:996] (3/4) Epoch 9, batch 3100, loss[loss=0.208, simple_loss=0.3064, pruned_loss=0.05482, over 21697.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3008, pruned_loss=0.07034, over 4290788.86 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 16.0 2023-06-26 03:32:57,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1482342.0, ans=0.125 2023-06-26 03:32:59,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1482402.0, ans=0.1 2023-06-26 03:33:00,506 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.68 vs. limit=15.0 2023-06-26 03:34:05,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1482522.0, ans=0.0 2023-06-26 03:34:36,171 INFO [train.py:996] (3/4) Epoch 9, batch 3150, loss[loss=0.3028, simple_loss=0.3609, pruned_loss=0.1224, over 21422.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3021, pruned_loss=0.07059, over 4292777.60 frames. ], batch size: 471, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:34:57,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.87 vs. limit=15.0 2023-06-26 03:34:58,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.871e+02 4.385e+02 6.208e+02 9.255e+02 2.149e+03, threshold=1.242e+03, percent-clipped=3.0 2023-06-26 03:36:07,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1482822.0, ans=0.1 2023-06-26 03:36:28,381 INFO [train.py:996] (3/4) Epoch 9, batch 3200, loss[loss=0.23, simple_loss=0.3067, pruned_loss=0.07665, over 21240.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3044, pruned_loss=0.07125, over 4295050.12 frames. ], batch size: 143, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 03:36:56,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1483002.0, ans=0.125 2023-06-26 03:37:40,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1483122.0, ans=0.0 2023-06-26 03:37:56,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1483182.0, ans=0.0 2023-06-26 03:38:13,652 INFO [train.py:996] (3/4) Epoch 9, batch 3250, loss[loss=0.2156, simple_loss=0.2787, pruned_loss=0.07628, over 21598.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3053, pruned_loss=0.07354, over 4297071.91 frames. ], batch size: 415, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:38:37,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1483242.0, ans=0.0 2023-06-26 03:38:47,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.805e+02 4.877e+02 6.649e+02 1.271e+03 2.472e+03, threshold=1.330e+03, percent-clipped=27.0 2023-06-26 03:38:59,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1483302.0, ans=0.125 2023-06-26 03:39:08,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.39 vs. limit=6.0 2023-06-26 03:39:31,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1483422.0, ans=0.05 2023-06-26 03:39:58,091 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=15.0 2023-06-26 03:40:00,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1483482.0, ans=0.125 2023-06-26 03:40:00,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1483482.0, ans=0.0 2023-06-26 03:40:05,653 INFO [train.py:996] (3/4) Epoch 9, batch 3300, loss[loss=0.2269, simple_loss=0.3256, pruned_loss=0.06406, over 21594.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3005, pruned_loss=0.07264, over 4295266.80 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:40:33,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1483602.0, ans=0.125 2023-06-26 03:40:48,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1483602.0, ans=0.125 2023-06-26 03:42:03,513 INFO [train.py:996] (3/4) Epoch 9, batch 3350, loss[loss=0.2317, simple_loss=0.3077, pruned_loss=0.07782, over 21783.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3031, pruned_loss=0.07357, over 4290647.24 frames. ], batch size: 247, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:42:18,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1483842.0, ans=0.125 2023-06-26 03:42:36,950 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.432e+02 5.027e+02 7.904e+02 1.051e+03 2.659e+03, threshold=1.581e+03, percent-clipped=15.0 2023-06-26 03:42:46,473 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:42:46,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1483902.0, ans=0.125 2023-06-26 03:43:30,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1484082.0, ans=0.0 2023-06-26 03:43:35,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=12.0 2023-06-26 03:43:39,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1484082.0, ans=0.0 2023-06-26 03:43:46,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=22.5 2023-06-26 03:43:56,169 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=12.0 2023-06-26 03:43:58,357 INFO [train.py:996] (3/4) Epoch 9, batch 3400, loss[loss=0.2166, simple_loss=0.3149, pruned_loss=0.05917, over 21808.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3032, pruned_loss=0.07405, over 4288512.97 frames. ], batch size: 351, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:44:00,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1484142.0, ans=0.125 2023-06-26 03:44:56,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1484262.0, ans=0.125 2023-06-26 03:44:56,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-26 03:45:51,261 INFO [train.py:996] (3/4) Epoch 9, batch 3450, loss[loss=0.2218, simple_loss=0.2892, pruned_loss=0.07717, over 21361.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2979, pruned_loss=0.07315, over 4286832.97 frames. ], batch size: 211, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:46:05,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1484442.0, ans=0.125 2023-06-26 03:46:19,613 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.178e+02 5.080e+02 7.210e+02 9.972e+02 1.993e+03, threshold=1.442e+03, percent-clipped=4.0 2023-06-26 03:46:29,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1484502.0, ans=0.0 2023-06-26 03:46:29,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1484502.0, ans=0.125 2023-06-26 03:46:46,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1484562.0, ans=0.2 2023-06-26 03:46:46,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1484562.0, ans=0.2 2023-06-26 03:47:32,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1484682.0, ans=0.125 2023-06-26 03:47:47,814 INFO [train.py:996] (3/4) Epoch 9, batch 3500, loss[loss=0.2412, simple_loss=0.333, pruned_loss=0.07468, over 21577.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3048, pruned_loss=0.07538, over 4288575.65 frames. ], batch size: 230, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:47:53,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1484742.0, ans=0.125 2023-06-26 03:48:48,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1484922.0, ans=0.1 2023-06-26 03:48:58,731 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-26 03:49:37,586 INFO [train.py:996] (3/4) Epoch 9, batch 3550, loss[loss=0.2101, simple_loss=0.2762, pruned_loss=0.07203, over 21753.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3073, pruned_loss=0.07714, over 4291573.04 frames. ], batch size: 102, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:49:55,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1485042.0, ans=0.0 2023-06-26 03:50:05,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.388e+02 4.836e+02 6.336e+02 9.493e+02 2.947e+03, threshold=1.267e+03, percent-clipped=8.0 2023-06-26 03:50:58,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1485222.0, ans=0.125 2023-06-26 03:51:07,165 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-26 03:51:26,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1485342.0, ans=0.125 2023-06-26 03:51:27,686 INFO [train.py:996] (3/4) Epoch 9, batch 3600, loss[loss=0.1738, simple_loss=0.2334, pruned_loss=0.05709, over 21511.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3001, pruned_loss=0.07547, over 4291765.88 frames. ], batch size: 213, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 03:51:52,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1485402.0, ans=0.125 2023-06-26 03:52:13,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1485462.0, ans=0.0 2023-06-26 03:52:49,283 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:53:28,613 INFO [train.py:996] (3/4) Epoch 9, batch 3650, loss[loss=0.2035, simple_loss=0.2556, pruned_loss=0.07571, over 20264.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3023, pruned_loss=0.07586, over 4285236.62 frames. ], batch size: 703, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:53:30,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-26 03:53:38,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1485642.0, ans=0.125 2023-06-26 03:53:45,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1485702.0, ans=0.125 2023-06-26 03:53:53,235 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.365e+02 4.857e+02 6.488e+02 1.037e+03 3.171e+03, threshold=1.298e+03, percent-clipped=18.0 2023-06-26 03:53:59,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485702.0, ans=0.1 2023-06-26 03:54:07,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1485762.0, ans=0.035 2023-06-26 03:54:24,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1485762.0, ans=0.125 2023-06-26 03:54:54,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1485882.0, ans=0.125 2023-06-26 03:54:57,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485882.0, ans=0.1 2023-06-26 03:55:19,268 INFO [train.py:996] (3/4) Epoch 9, batch 3700, loss[loss=0.2133, simple_loss=0.2848, pruned_loss=0.07089, over 21797.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3036, pruned_loss=0.07603, over 4285251.55 frames. ], batch size: 247, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:56:09,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=12.0 2023-06-26 03:57:10,164 INFO [train.py:996] (3/4) Epoch 9, batch 3750, loss[loss=0.1905, simple_loss=0.2668, pruned_loss=0.05708, over 21778.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3014, pruned_loss=0.07544, over 4289026.89 frames. ], batch size: 282, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:57:28,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1486302.0, ans=0.2 2023-06-26 03:57:35,343 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.387e+02 4.638e+02 6.369e+02 1.007e+03 1.951e+03, threshold=1.274e+03, percent-clipped=10.0 2023-06-26 03:57:48,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1486302.0, ans=0.95 2023-06-26 03:57:51,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1486362.0, ans=0.125 2023-06-26 03:58:06,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1486362.0, ans=0.0 2023-06-26 03:58:46,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1486482.0, ans=10.0 2023-06-26 03:59:00,831 INFO [train.py:996] (3/4) Epoch 9, batch 3800, loss[loss=0.2073, simple_loss=0.289, pruned_loss=0.06284, over 21696.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2978, pruned_loss=0.07296, over 4289048.97 frames. ], batch size: 351, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 03:59:17,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1486602.0, ans=0.0 2023-06-26 04:00:28,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1486722.0, ans=0.0 2023-06-26 04:00:49,624 INFO [train.py:996] (3/4) Epoch 9, batch 3850, loss[loss=0.1968, simple_loss=0.2535, pruned_loss=0.07002, over 20201.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2964, pruned_loss=0.07375, over 4274641.52 frames. ], batch size: 703, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:01:19,294 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.209e+02 4.310e+02 5.472e+02 7.871e+02 1.774e+03, threshold=1.094e+03, percent-clipped=3.0 2023-06-26 04:01:28,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1486962.0, ans=0.0 2023-06-26 04:02:22,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1487082.0, ans=0.1 2023-06-26 04:02:36,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1487082.0, ans=0.2 2023-06-26 04:02:39,289 INFO [train.py:996] (3/4) Epoch 9, batch 3900, loss[loss=0.2451, simple_loss=0.3205, pruned_loss=0.08481, over 21363.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2918, pruned_loss=0.07263, over 4272214.13 frames. ], batch size: 548, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:02:45,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1487142.0, ans=0.2 2023-06-26 04:02:59,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1487202.0, ans=0.04949747468305833 2023-06-26 04:04:09,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1487322.0, ans=0.125 2023-06-26 04:04:11,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1487382.0, ans=0.125 2023-06-26 04:04:24,157 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=22.5 2023-06-26 04:04:29,553 INFO [train.py:996] (3/4) Epoch 9, batch 3950, loss[loss=0.2042, simple_loss=0.2937, pruned_loss=0.0574, over 21636.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2933, pruned_loss=0.07118, over 4280286.09 frames. ], batch size: 441, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:04:59,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.430e+02 5.269e+02 7.379e+02 1.187e+03 2.051e+03, threshold=1.476e+03, percent-clipped=29.0 2023-06-26 04:05:57,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1487622.0, ans=0.0 2023-06-26 04:06:10,558 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.04 vs. limit=22.5 2023-06-26 04:06:14,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1487682.0, ans=0.125 2023-06-26 04:06:21,601 INFO [train.py:996] (3/4) Epoch 9, batch 4000, loss[loss=0.1939, simple_loss=0.2592, pruned_loss=0.06423, over 21555.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.287, pruned_loss=0.06806, over 4273961.17 frames. ], batch size: 391, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 04:06:41,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1487742.0, ans=0.1 2023-06-26 04:07:04,900 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:07:19,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1487862.0, ans=0.0 2023-06-26 04:07:51,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-26 04:08:15,054 INFO [train.py:996] (3/4) Epoch 9, batch 4050, loss[loss=0.209, simple_loss=0.2976, pruned_loss=0.06021, over 21446.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2862, pruned_loss=0.06637, over 4267429.82 frames. ], batch size: 194, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:08:54,031 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 4.408e+02 5.792e+02 1.027e+03 1.957e+03, threshold=1.158e+03, percent-clipped=6.0 2023-06-26 04:08:54,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1488102.0, ans=0.125 2023-06-26 04:09:40,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1488222.0, ans=0.125 2023-06-26 04:10:05,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1488342.0, ans=0.125 2023-06-26 04:10:06,485 INFO [train.py:996] (3/4) Epoch 9, batch 4100, loss[loss=0.2057, simple_loss=0.288, pruned_loss=0.06171, over 21801.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2889, pruned_loss=0.06733, over 4279505.55 frames. ], batch size: 332, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:10:27,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1488342.0, ans=0.125 2023-06-26 04:10:27,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1488342.0, ans=0.0 2023-06-26 04:11:18,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1488462.0, ans=0.125 2023-06-26 04:11:56,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1488582.0, ans=0.1 2023-06-26 04:11:58,953 INFO [train.py:996] (3/4) Epoch 9, batch 4150, loss[loss=0.1853, simple_loss=0.2494, pruned_loss=0.06063, over 21299.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2902, pruned_loss=0.0652, over 4279739.20 frames. ], batch size: 551, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:12:08,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1488642.0, ans=0.125 2023-06-26 04:12:42,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.917e+02 4.750e+02 6.636e+02 9.716e+02 1.939e+03, threshold=1.327e+03, percent-clipped=13.0 2023-06-26 04:13:18,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1488822.0, ans=0.125 2023-06-26 04:13:24,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-26 04:13:28,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.89 vs. limit=15.0 2023-06-26 04:13:36,784 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=22.5 2023-06-26 04:13:57,434 INFO [train.py:996] (3/4) Epoch 9, batch 4200, loss[loss=0.1966, simple_loss=0.2619, pruned_loss=0.06568, over 21699.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2895, pruned_loss=0.06408, over 4269104.10 frames. ], batch size: 112, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:14:33,422 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=15.0 2023-06-26 04:14:34,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1489002.0, ans=0.0 2023-06-26 04:14:39,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1489002.0, ans=0.1 2023-06-26 04:15:40,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1489182.0, ans=0.125 2023-06-26 04:15:56,535 INFO [train.py:996] (3/4) Epoch 9, batch 4250, loss[loss=0.235, simple_loss=0.312, pruned_loss=0.07894, over 21593.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2966, pruned_loss=0.06664, over 4268869.78 frames. ], batch size: 230, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:16:27,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1489302.0, ans=0.125 2023-06-26 04:16:34,520 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.173e+02 6.968e+02 9.905e+02 1.425e+03 3.258e+03, threshold=1.981e+03, percent-clipped=30.0 2023-06-26 04:16:35,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1489302.0, ans=0.1 2023-06-26 04:17:55,659 INFO [train.py:996] (3/4) Epoch 9, batch 4300, loss[loss=0.2266, simple_loss=0.3349, pruned_loss=0.05915, over 21629.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3027, pruned_loss=0.06855, over 4274909.97 frames. ], batch size: 389, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:19:09,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1489722.0, ans=0.125 2023-06-26 04:19:13,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1489722.0, ans=0.0 2023-06-26 04:19:18,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-26 04:19:27,317 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.16 vs. limit=22.5 2023-06-26 04:19:52,241 INFO [train.py:996] (3/4) Epoch 9, batch 4350, loss[loss=0.208, simple_loss=0.2772, pruned_loss=0.06939, over 21850.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.302, pruned_loss=0.06881, over 4262297.83 frames. ], batch size: 98, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:20:15,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=22.5 2023-06-26 04:20:17,831 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:20:18,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.396e+02 4.613e+02 6.929e+02 1.161e+03 2.829e+03, threshold=1.386e+03, percent-clipped=7.0 2023-06-26 04:20:24,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1489902.0, ans=0.2 2023-06-26 04:21:17,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1490082.0, ans=0.125 2023-06-26 04:21:26,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1490082.0, ans=0.125 2023-06-26 04:21:42,261 INFO [train.py:996] (3/4) Epoch 9, batch 4400, loss[loss=0.2057, simple_loss=0.2969, pruned_loss=0.05727, over 21652.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2983, pruned_loss=0.06777, over 4269955.66 frames. ], batch size: 247, lr: 3.36e-03, grad_scale: 32.0 2023-06-26 04:21:44,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1490142.0, ans=0.1 2023-06-26 04:22:04,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1490202.0, ans=0.1 2023-06-26 04:22:25,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1490262.0, ans=15.0 2023-06-26 04:23:03,786 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.76 vs. limit=8.0 2023-06-26 04:23:35,718 INFO [train.py:996] (3/4) Epoch 9, batch 4450, loss[loss=0.3262, simple_loss=0.4431, pruned_loss=0.1046, over 19711.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3063, pruned_loss=0.06957, over 4268568.68 frames. ], batch size: 702, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:23:36,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1490442.0, ans=0.125 2023-06-26 04:24:02,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1490502.0, ans=0.0 2023-06-26 04:24:03,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.467e+02 5.132e+02 7.510e+02 1.153e+03 2.650e+03, threshold=1.502e+03, percent-clipped=12.0 2023-06-26 04:24:54,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1490622.0, ans=0.0 2023-06-26 04:25:19,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-26 04:25:25,700 INFO [train.py:996] (3/4) Epoch 9, batch 4500, loss[loss=0.2265, simple_loss=0.2943, pruned_loss=0.07935, over 20805.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3078, pruned_loss=0.07181, over 4275231.12 frames. ], batch size: 611, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:26:11,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1490862.0, ans=0.125 2023-06-26 04:26:18,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1490862.0, ans=0.07 2023-06-26 04:26:48,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1490922.0, ans=0.1 2023-06-26 04:26:57,041 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.19 vs. limit=15.0 2023-06-26 04:26:59,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1490982.0, ans=0.125 2023-06-26 04:27:15,814 INFO [train.py:996] (3/4) Epoch 9, batch 4550, loss[loss=0.2449, simple_loss=0.3183, pruned_loss=0.08574, over 21359.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3094, pruned_loss=0.07201, over 4271956.08 frames. ], batch size: 548, lr: 3.36e-03, grad_scale: 16.0 2023-06-26 04:28:00,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.475e+02 4.870e+02 6.557e+02 1.171e+03 3.635e+03, threshold=1.311e+03, percent-clipped=15.0 2023-06-26 04:28:04,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1491102.0, ans=0.0 2023-06-26 04:28:17,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=15.0 2023-06-26 04:28:28,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1491162.0, ans=0.1 2023-06-26 04:28:37,096 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=12.0 2023-06-26 04:28:48,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1491282.0, ans=0.0 2023-06-26 04:29:00,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1491282.0, ans=0.125 2023-06-26 04:29:05,593 INFO [train.py:996] (3/4) Epoch 9, batch 4600, loss[loss=0.2127, simple_loss=0.2917, pruned_loss=0.06689, over 21431.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3092, pruned_loss=0.07299, over 4276799.57 frames. ], batch size: 211, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:29:43,783 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-26 04:29:46,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1491402.0, ans=0.125 2023-06-26 04:29:51,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1491402.0, ans=0.125 2023-06-26 04:29:57,330 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:30:51,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1491582.0, ans=0.0 2023-06-26 04:31:00,435 INFO [train.py:996] (3/4) Epoch 9, batch 4650, loss[loss=0.1653, simple_loss=0.2407, pruned_loss=0.04494, over 21776.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3025, pruned_loss=0.07072, over 4285528.83 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:31:34,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1491702.0, ans=0.125 2023-06-26 04:31:37,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1491702.0, ans=10.0 2023-06-26 04:31:38,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.118e+02 4.318e+02 5.535e+02 7.322e+02 1.899e+03, threshold=1.107e+03, percent-clipped=2.0 2023-06-26 04:31:40,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.14 vs. limit=22.5 2023-06-26 04:32:23,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1491822.0, ans=0.1 2023-06-26 04:32:45,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-26 04:32:55,320 INFO [train.py:996] (3/4) Epoch 9, batch 4700, loss[loss=0.2138, simple_loss=0.2693, pruned_loss=0.07912, over 21456.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2928, pruned_loss=0.06864, over 4276040.07 frames. ], batch size: 473, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:33:03,161 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-26 04:33:11,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1491942.0, ans=0.0 2023-06-26 04:33:32,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1492002.0, ans=0.125 2023-06-26 04:33:45,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1492062.0, ans=0.2 2023-06-26 04:34:19,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-26 04:34:38,023 INFO [train.py:996] (3/4) Epoch 9, batch 4750, loss[loss=0.1866, simple_loss=0.2512, pruned_loss=0.06101, over 21739.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2881, pruned_loss=0.06925, over 4282682.54 frames. ], batch size: 283, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:35:16,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 4.450e+02 6.729e+02 1.004e+03 1.717e+03, threshold=1.346e+03, percent-clipped=12.0 2023-06-26 04:36:04,417 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-26 04:36:16,969 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-26 04:36:32,937 INFO [train.py:996] (3/4) Epoch 9, batch 4800, loss[loss=0.2368, simple_loss=0.3185, pruned_loss=0.07752, over 21815.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.289, pruned_loss=0.06966, over 4286342.30 frames. ], batch size: 414, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 04:36:59,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-26 04:37:06,637 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-26 04:37:17,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1492662.0, ans=0.1 2023-06-26 04:37:29,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1492662.0, ans=0.1 2023-06-26 04:37:30,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1492662.0, ans=0.2 2023-06-26 04:37:32,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1492662.0, ans=0.0 2023-06-26 04:37:37,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1492722.0, ans=0.1 2023-06-26 04:38:11,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1492782.0, ans=0.125 2023-06-26 04:38:21,209 INFO [train.py:996] (3/4) Epoch 9, batch 4850, loss[loss=0.2056, simple_loss=0.2841, pruned_loss=0.06359, over 21840.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2889, pruned_loss=0.06919, over 4293155.86 frames. ], batch size: 298, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:38:28,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1492842.0, ans=0.125 2023-06-26 04:38:53,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1492902.0, ans=0.2 2023-06-26 04:38:56,369 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.280e+02 4.162e+02 5.020e+02 8.423e+02 2.243e+03, threshold=1.004e+03, percent-clipped=7.0 2023-06-26 04:39:06,398 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:40:11,790 INFO [train.py:996] (3/4) Epoch 9, batch 4900, loss[loss=0.2446, simple_loss=0.3377, pruned_loss=0.07574, over 21663.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2922, pruned_loss=0.07053, over 4300165.98 frames. ], batch size: 389, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:40:43,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-26 04:42:01,683 INFO [train.py:996] (3/4) Epoch 9, batch 4950, loss[loss=0.1886, simple_loss=0.2887, pruned_loss=0.04427, over 21605.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2961, pruned_loss=0.06938, over 4284699.33 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:42:21,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1493442.0, ans=0.0 2023-06-26 04:42:33,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1493502.0, ans=0.5 2023-06-26 04:42:42,359 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.982e+02 5.004e+02 7.690e+02 1.209e+03 2.410e+03, threshold=1.538e+03, percent-clipped=31.0 2023-06-26 04:43:49,283 INFO [train.py:996] (3/4) Epoch 9, batch 5000, loss[loss=0.2079, simple_loss=0.2862, pruned_loss=0.06474, over 21808.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2956, pruned_loss=0.06653, over 4287557.74 frames. ], batch size: 298, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:43:57,352 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-26 04:44:53,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1493922.0, ans=0.2 2023-06-26 04:44:56,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=22.5 2023-06-26 04:45:00,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1493922.0, ans=0.125 2023-06-26 04:45:06,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-26 04:45:37,553 INFO [train.py:996] (3/4) Epoch 9, batch 5050, loss[loss=0.2492, simple_loss=0.306, pruned_loss=0.09622, over 21780.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2951, pruned_loss=0.06816, over 4296513.68 frames. ], batch size: 508, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:45:41,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1494042.0, ans=0.0 2023-06-26 04:45:48,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1494042.0, ans=0.125 2023-06-26 04:46:12,688 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.467e+02 4.718e+02 6.361e+02 8.600e+02 1.640e+03, threshold=1.272e+03, percent-clipped=2.0 2023-06-26 04:46:40,539 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-26 04:47:26,058 INFO [train.py:996] (3/4) Epoch 9, batch 5100, loss[loss=0.1757, simple_loss=0.2568, pruned_loss=0.04727, over 21324.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2923, pruned_loss=0.06772, over 4298763.94 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:48:18,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=22.5 2023-06-26 04:48:19,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1494462.0, ans=0.125 2023-06-26 04:49:09,870 INFO [train.py:996] (3/4) Epoch 9, batch 5150, loss[loss=0.2338, simple_loss=0.315, pruned_loss=0.07632, over 21705.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2914, pruned_loss=0.06872, over 4299430.07 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:49:19,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1494642.0, ans=0.125 2023-06-26 04:49:20,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1494642.0, ans=0.0 2023-06-26 04:49:37,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=22.5 2023-06-26 04:49:50,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.908e+02 4.572e+02 6.344e+02 1.136e+03 2.635e+03, threshold=1.269e+03, percent-clipped=18.0 2023-06-26 04:50:29,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1494822.0, ans=0.2 2023-06-26 04:51:10,736 INFO [train.py:996] (3/4) Epoch 9, batch 5200, loss[loss=0.2512, simple_loss=0.3591, pruned_loss=0.07168, over 21863.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2952, pruned_loss=0.06949, over 4294507.90 frames. ], batch size: 371, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 04:51:49,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1495002.0, ans=0.2 2023-06-26 04:52:04,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1495062.0, ans=0.125 2023-06-26 04:52:16,311 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:52:18,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1495122.0, ans=0.125 2023-06-26 04:52:28,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1495122.0, ans=0.125 2023-06-26 04:52:58,597 INFO [train.py:996] (3/4) Epoch 9, batch 5250, loss[loss=0.2145, simple_loss=0.2891, pruned_loss=0.06993, over 21335.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.3007, pruned_loss=0.06837, over 4290684.17 frames. ], batch size: 131, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:53:35,848 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.142e+02 4.723e+02 6.704e+02 8.682e+02 1.617e+03, threshold=1.341e+03, percent-clipped=7.0 2023-06-26 04:54:50,683 INFO [train.py:996] (3/4) Epoch 9, batch 5300, loss[loss=0.2255, simple_loss=0.2854, pruned_loss=0.08281, over 21829.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2996, pruned_loss=0.0688, over 4294784.09 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:55:03,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1495542.0, ans=0.0 2023-06-26 04:55:12,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1495602.0, ans=0.0 2023-06-26 04:55:19,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1495602.0, ans=0.125 2023-06-26 04:55:56,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1495722.0, ans=0.2 2023-06-26 04:55:58,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1495722.0, ans=0.125 2023-06-26 04:56:24,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1495782.0, ans=0.5 2023-06-26 04:56:39,197 INFO [train.py:996] (3/4) Epoch 9, batch 5350, loss[loss=0.262, simple_loss=0.3736, pruned_loss=0.07519, over 20722.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2986, pruned_loss=0.06962, over 4297374.72 frames. ], batch size: 607, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:56:42,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=22.5 2023-06-26 04:57:06,597 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:57:15,458 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.489e+02 4.386e+02 5.571e+02 7.652e+02 1.743e+03, threshold=1.114e+03, percent-clipped=3.0 2023-06-26 04:57:25,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1495962.0, ans=0.2 2023-06-26 04:57:39,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1495962.0, ans=0.125 2023-06-26 04:58:19,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1496082.0, ans=0.125 2023-06-26 04:58:24,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1496082.0, ans=0.0 2023-06-26 04:58:27,560 INFO [train.py:996] (3/4) Epoch 9, batch 5400, loss[loss=0.221, simple_loss=0.2818, pruned_loss=0.08011, over 21473.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2979, pruned_loss=0.07022, over 4294803.71 frames. ], batch size: 194, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 04:58:36,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-26 04:59:09,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1496262.0, ans=0.1 2023-06-26 04:59:26,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.11 vs. limit=6.0 2023-06-26 04:59:45,227 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-26 05:00:00,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-26 05:00:08,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1496382.0, ans=0.0 2023-06-26 05:00:22,769 INFO [train.py:996] (3/4) Epoch 9, batch 5450, loss[loss=0.2417, simple_loss=0.337, pruned_loss=0.07322, over 21654.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2997, pruned_loss=0.07005, over 4297344.42 frames. ], batch size: 389, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:00:25,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1496442.0, ans=0.125 2023-06-26 05:00:33,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1496442.0, ans=0.125 2023-06-26 05:00:54,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.911e+02 4.664e+02 7.291e+02 1.143e+03 2.963e+03, threshold=1.458e+03, percent-clipped=26.0 2023-06-26 05:02:12,195 INFO [train.py:996] (3/4) Epoch 9, batch 5500, loss[loss=0.1924, simple_loss=0.2877, pruned_loss=0.04851, over 21618.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.3027, pruned_loss=0.06739, over 4285436.75 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:02:13,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-26 05:02:21,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1496742.0, ans=0.0 2023-06-26 05:04:02,037 INFO [train.py:996] (3/4) Epoch 9, batch 5550, loss[loss=0.192, simple_loss=0.2896, pruned_loss=0.04715, over 21577.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.3025, pruned_loss=0.06451, over 4287813.25 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:04:15,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1497042.0, ans=0.125 2023-06-26 05:04:22,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1497042.0, ans=0.125 2023-06-26 05:04:29,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1497102.0, ans=0.125 2023-06-26 05:04:42,241 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-26 05:04:44,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.199e+02 5.816e+02 9.061e+02 1.223e+03 2.185e+03, threshold=1.812e+03, percent-clipped=16.0 2023-06-26 05:04:54,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1497162.0, ans=0.2 2023-06-26 05:05:25,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1497222.0, ans=0.0 2023-06-26 05:05:55,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1497282.0, ans=0.125 2023-06-26 05:05:57,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1497342.0, ans=0.125 2023-06-26 05:05:58,678 INFO [train.py:996] (3/4) Epoch 9, batch 5600, loss[loss=0.2315, simple_loss=0.323, pruned_loss=0.07, over 21611.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2988, pruned_loss=0.06201, over 4282181.16 frames. ], batch size: 263, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 05:06:10,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1497342.0, ans=0.125 2023-06-26 05:07:25,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1497582.0, ans=0.125 2023-06-26 05:07:45,527 INFO [train.py:996] (3/4) Epoch 9, batch 5650, loss[loss=0.2273, simple_loss=0.308, pruned_loss=0.07329, over 21910.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.3025, pruned_loss=0.06515, over 4290456.04 frames. ], batch size: 107, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:08:29,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.963e+02 5.175e+02 8.774e+02 1.262e+03 2.376e+03, threshold=1.755e+03, percent-clipped=8.0 2023-06-26 05:08:31,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1497762.0, ans=0.125 2023-06-26 05:08:31,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1497762.0, ans=0.1 2023-06-26 05:08:35,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1497762.0, ans=0.0 2023-06-26 05:09:03,267 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.42 vs. limit=10.0 2023-06-26 05:09:36,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1497882.0, ans=0.1 2023-06-26 05:09:36,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1497882.0, ans=0.1 2023-06-26 05:09:40,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1497942.0, ans=0.0 2023-06-26 05:09:41,564 INFO [train.py:996] (3/4) Epoch 9, batch 5700, loss[loss=0.1958, simple_loss=0.2609, pruned_loss=0.06529, over 20096.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.3011, pruned_loss=0.06624, over 4284837.13 frames. ], batch size: 702, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:10:12,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1498002.0, ans=0.125 2023-06-26 05:10:26,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1498062.0, ans=0.125 2023-06-26 05:10:49,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1498122.0, ans=0.125 2023-06-26 05:11:09,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1498122.0, ans=0.125 2023-06-26 05:11:39,514 INFO [train.py:996] (3/4) Epoch 9, batch 5750, loss[loss=0.2005, simple_loss=0.2698, pruned_loss=0.06556, over 21213.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.297, pruned_loss=0.06392, over 4286320.11 frames. ], batch size: 608, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:12:18,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.313e+02 4.582e+02 6.982e+02 1.089e+03 2.466e+03, threshold=1.396e+03, percent-clipped=2.0 2023-06-26 05:12:19,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1498362.0, ans=0.125 2023-06-26 05:13:21,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1498482.0, ans=0.0 2023-06-26 05:13:24,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1498482.0, ans=0.0 2023-06-26 05:13:28,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1498482.0, ans=0.2 2023-06-26 05:13:31,257 INFO [train.py:996] (3/4) Epoch 9, batch 5800, loss[loss=0.2316, simple_loss=0.3425, pruned_loss=0.06034, over 20804.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2958, pruned_loss=0.06226, over 4283067.39 frames. ], batch size: 607, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:14:04,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1498602.0, ans=0.95 2023-06-26 05:14:32,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1498662.0, ans=10.0 2023-06-26 05:14:54,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-26 05:15:09,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1498782.0, ans=0.125 2023-06-26 05:15:12,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1498782.0, ans=0.125 2023-06-26 05:15:27,924 INFO [train.py:996] (3/4) Epoch 9, batch 5850, loss[loss=0.1681, simple_loss=0.2766, pruned_loss=0.02974, over 21800.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2929, pruned_loss=0.05809, over 4270041.06 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:15:35,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=22.5 2023-06-26 05:16:05,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.859e+02 4.519e+02 6.797e+02 9.504e+02 2.240e+03, threshold=1.359e+03, percent-clipped=6.0 2023-06-26 05:16:37,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1499022.0, ans=0.0 2023-06-26 05:17:15,208 INFO [train.py:996] (3/4) Epoch 9, batch 5900, loss[loss=0.1794, simple_loss=0.2755, pruned_loss=0.04167, over 21677.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2866, pruned_loss=0.05428, over 4265388.98 frames. ], batch size: 414, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:17:15,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1499142.0, ans=0.07 2023-06-26 05:18:33,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1499322.0, ans=0.125 2023-06-26 05:19:03,494 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-26 05:19:04,244 INFO [train.py:996] (3/4) Epoch 9, batch 5950, loss[loss=0.1866, simple_loss=0.2646, pruned_loss=0.05433, over 21842.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2863, pruned_loss=0.05612, over 4267114.51 frames. ], batch size: 298, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:19:12,652 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.97 vs. limit=15.0 2023-06-26 05:19:47,182 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.880e+02 4.429e+02 6.642e+02 9.511e+02 2.071e+03, threshold=1.328e+03, percent-clipped=8.0 2023-06-26 05:20:46,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1499682.0, ans=0.0 2023-06-26 05:20:50,639 INFO [train.py:996] (3/4) Epoch 9, batch 6000, loss[loss=0.2125, simple_loss=0.2714, pruned_loss=0.07678, over 21508.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2824, pruned_loss=0.059, over 4261350.84 frames. ], batch size: 391, lr: 3.35e-03, grad_scale: 32.0 2023-06-26 05:20:50,639 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 05:21:11,487 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2616, simple_loss=0.3531, pruned_loss=0.08508, over 1796401.00 frames. 2023-06-26 05:21:11,488 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 05:21:17,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1499742.0, ans=0.035 2023-06-26 05:21:21,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1499742.0, ans=0.1 2023-06-26 05:21:24,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1499742.0, ans=0.125 2023-06-26 05:21:34,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1499802.0, ans=0.125 2023-06-26 05:22:07,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1499862.0, ans=0.0 2023-06-26 05:22:22,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1499922.0, ans=0.125 2023-06-26 05:22:34,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1499922.0, ans=0.0 2023-06-26 05:22:39,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1499982.0, ans=0.2 2023-06-26 05:22:40,274 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-26 05:23:03,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1499982.0, ans=0.0 2023-06-26 05:23:08,667 INFO [train.py:996] (3/4) Epoch 9, batch 6050, loss[loss=0.2176, simple_loss=0.2727, pruned_loss=0.08128, over 21206.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2779, pruned_loss=0.06093, over 4257552.64 frames. ], batch size: 159, lr: 3.35e-03, grad_scale: 16.0 2023-06-26 05:23:48,264 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 4.915e+02 7.181e+02 1.064e+03 2.049e+03, threshold=1.436e+03, percent-clipped=12.0 2023-06-26 05:24:16,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1500222.0, ans=0.1 2023-06-26 05:24:20,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1500222.0, ans=0.0 2023-06-26 05:24:55,881 INFO [train.py:996] (3/4) Epoch 9, batch 6100, loss[loss=0.2192, simple_loss=0.2866, pruned_loss=0.07589, over 21324.00 frames. ], tot_loss[loss=0.1969, simple_loss=0.276, pruned_loss=0.05895, over 4262628.99 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:25:10,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1500342.0, ans=0.2 2023-06-26 05:25:16,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1500402.0, ans=0.125 2023-06-26 05:25:33,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.34 vs. limit=8.0 2023-06-26 05:26:30,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1500582.0, ans=0.0 2023-06-26 05:26:43,521 INFO [train.py:996] (3/4) Epoch 9, batch 6150, loss[loss=0.1997, simple_loss=0.2809, pruned_loss=0.0593, over 21415.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2793, pruned_loss=0.062, over 4263997.09 frames. ], batch size: 212, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:26:51,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1500642.0, ans=0.125 2023-06-26 05:26:57,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1500642.0, ans=0.0 2023-06-26 05:27:22,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-26 05:27:23,503 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.389e+02 4.733e+02 6.899e+02 9.489e+02 3.075e+03, threshold=1.380e+03, percent-clipped=10.0 2023-06-26 05:27:24,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1500762.0, ans=0.1 2023-06-26 05:27:27,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1500762.0, ans=0.125 2023-06-26 05:27:34,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1500762.0, ans=0.0 2023-06-26 05:27:48,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1500822.0, ans=0.2 2023-06-26 05:28:29,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1500882.0, ans=0.0 2023-06-26 05:28:32,068 INFO [train.py:996] (3/4) Epoch 9, batch 6200, loss[loss=0.2059, simple_loss=0.2766, pruned_loss=0.06758, over 21392.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2829, pruned_loss=0.06325, over 4270450.80 frames. ], batch size: 144, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:28:38,717 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-26 05:28:47,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-26 05:28:48,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1501002.0, ans=0.1 2023-06-26 05:30:14,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1501182.0, ans=10.0 2023-06-26 05:30:21,386 INFO [train.py:996] (3/4) Epoch 9, batch 6250, loss[loss=0.1485, simple_loss=0.22, pruned_loss=0.03846, over 17026.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2904, pruned_loss=0.06362, over 4271324.86 frames. ], batch size: 66, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:30:34,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.51 vs. limit=15.0 2023-06-26 05:31:01,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.531e+02 5.718e+02 9.151e+02 1.565e+03 3.193e+03, threshold=1.830e+03, percent-clipped=32.0 2023-06-26 05:31:22,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1501362.0, ans=0.0 2023-06-26 05:31:58,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1501482.0, ans=0.125 2023-06-26 05:32:01,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1501482.0, ans=0.125 2023-06-26 05:32:09,864 INFO [train.py:996] (3/4) Epoch 9, batch 6300, loss[loss=0.2626, simple_loss=0.3239, pruned_loss=0.1007, over 21721.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2945, pruned_loss=0.0633, over 4275117.92 frames. ], batch size: 507, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:33:00,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1501662.0, ans=0.125 2023-06-26 05:33:55,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-26 05:34:00,240 INFO [train.py:996] (3/4) Epoch 9, batch 6350, loss[loss=0.2454, simple_loss=0.3055, pruned_loss=0.09262, over 21365.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.298, pruned_loss=0.06737, over 4280078.83 frames. ], batch size: 548, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:34:01,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1501842.0, ans=0.125 2023-06-26 05:34:11,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1501842.0, ans=0.07 2023-06-26 05:34:52,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.804e+02 5.467e+02 7.732e+02 1.098e+03 2.787e+03, threshold=1.546e+03, percent-clipped=5.0 2023-06-26 05:35:00,038 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-26 05:35:08,754 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-26 05:35:37,560 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.95 vs. limit=15.0 2023-06-26 05:35:55,800 INFO [train.py:996] (3/4) Epoch 9, batch 6400, loss[loss=0.2637, simple_loss=0.336, pruned_loss=0.0957, over 21779.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3018, pruned_loss=0.06984, over 4281305.78 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 32.0 2023-06-26 05:35:56,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1502142.0, ans=0.2 2023-06-26 05:36:40,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1502202.0, ans=0.125 2023-06-26 05:36:49,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1502262.0, ans=0.1 2023-06-26 05:37:01,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1502262.0, ans=0.0 2023-06-26 05:37:45,708 INFO [train.py:996] (3/4) Epoch 9, batch 6450, loss[loss=0.1819, simple_loss=0.2613, pruned_loss=0.05124, over 21436.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3026, pruned_loss=0.0685, over 4284313.93 frames. ], batch size: 131, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:38:33,803 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.613e+02 5.276e+02 6.947e+02 1.153e+03 2.587e+03, threshold=1.389e+03, percent-clipped=9.0 2023-06-26 05:38:48,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1502562.0, ans=0.2 2023-06-26 05:39:35,596 INFO [train.py:996] (3/4) Epoch 9, batch 6500, loss[loss=0.1925, simple_loss=0.2945, pruned_loss=0.04528, over 21781.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2969, pruned_loss=0.06759, over 4279717.61 frames. ], batch size: 351, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:39:53,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-26 05:40:01,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1502802.0, ans=0.2 2023-06-26 05:40:23,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-26 05:40:49,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1502922.0, ans=0.125 2023-06-26 05:41:30,795 INFO [train.py:996] (3/4) Epoch 9, batch 6550, loss[loss=0.2013, simple_loss=0.2767, pruned_loss=0.06295, over 21602.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2945, pruned_loss=0.06645, over 4270284.83 frames. ], batch size: 230, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:41:44,944 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:42:19,565 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.317e+02 4.890e+02 6.578e+02 1.052e+03 2.225e+03, threshold=1.316e+03, percent-clipped=12.0 2023-06-26 05:42:51,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1503222.0, ans=0.125 2023-06-26 05:42:58,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-26 05:43:12,657 INFO [train.py:996] (3/4) Epoch 9, batch 6600, loss[loss=0.19, simple_loss=0.2625, pruned_loss=0.05874, over 21623.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2888, pruned_loss=0.06582, over 4269701.47 frames. ], batch size: 298, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:43:58,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1503462.0, ans=0.0 2023-06-26 05:44:15,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1503462.0, ans=0.125 2023-06-26 05:44:16,719 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:44:50,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1503582.0, ans=0.1 2023-06-26 05:44:50,825 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:45:04,847 INFO [train.py:996] (3/4) Epoch 9, batch 6650, loss[loss=0.2091, simple_loss=0.2705, pruned_loss=0.07386, over 21558.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2834, pruned_loss=0.06352, over 4267459.34 frames. ], batch size: 391, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:45:35,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1503702.0, ans=0.1 2023-06-26 05:45:41,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-26 05:45:53,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.932e+02 4.741e+02 6.331e+02 9.151e+02 2.148e+03, threshold=1.266e+03, percent-clipped=9.0 2023-06-26 05:46:25,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1503822.0, ans=0.1 2023-06-26 05:46:31,785 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.54 vs. limit=15.0 2023-06-26 05:46:54,058 INFO [train.py:996] (3/4) Epoch 9, batch 6700, loss[loss=0.1785, simple_loss=0.2482, pruned_loss=0.05444, over 21749.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2776, pruned_loss=0.06297, over 4265290.52 frames. ], batch size: 112, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:47:01,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1503942.0, ans=0.125 2023-06-26 05:47:19,160 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:47:53,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1504062.0, ans=0.125 2023-06-26 05:48:17,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1504182.0, ans=0.0 2023-06-26 05:48:36,340 INFO [train.py:996] (3/4) Epoch 9, batch 6750, loss[loss=0.2138, simple_loss=0.2915, pruned_loss=0.068, over 21917.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2775, pruned_loss=0.06351, over 4267132.41 frames. ], batch size: 124, lr: 3.34e-03, grad_scale: 8.0 2023-06-26 05:49:12,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1504302.0, ans=0.125 2023-06-26 05:49:29,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-26 05:49:31,079 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.439e+02 4.588e+02 6.610e+02 8.394e+02 1.640e+03, threshold=1.322e+03, percent-clipped=2.0 2023-06-26 05:49:47,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=22.5 2023-06-26 05:50:23,909 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-26 05:50:29,548 INFO [train.py:996] (3/4) Epoch 9, batch 6800, loss[loss=0.2243, simple_loss=0.2957, pruned_loss=0.07643, over 15563.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2797, pruned_loss=0.06544, over 4261509.97 frames. ], batch size: 64, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:52:06,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1504782.0, ans=0.0 2023-06-26 05:52:16,593 INFO [train.py:996] (3/4) Epoch 9, batch 6850, loss[loss=0.2076, simple_loss=0.2722, pruned_loss=0.07149, over 21259.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2783, pruned_loss=0.06688, over 4255277.80 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:53:05,512 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.496e+02 4.685e+02 7.280e+02 1.216e+03 2.418e+03, threshold=1.456e+03, percent-clipped=17.0 2023-06-26 05:54:03,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1505082.0, ans=0.1 2023-06-26 05:54:03,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1505082.0, ans=0.0 2023-06-26 05:54:05,600 INFO [train.py:996] (3/4) Epoch 9, batch 6900, loss[loss=0.21, simple_loss=0.2983, pruned_loss=0.06089, over 21742.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2787, pruned_loss=0.06689, over 4265399.58 frames. ], batch size: 414, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:54:09,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1505142.0, ans=0.0 2023-06-26 05:54:18,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1505142.0, ans=0.125 2023-06-26 05:54:34,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1505202.0, ans=0.1 2023-06-26 05:54:36,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1505202.0, ans=0.125 2023-06-26 05:55:54,117 INFO [train.py:996] (3/4) Epoch 9, batch 6950, loss[loss=0.2218, simple_loss=0.2957, pruned_loss=0.07388, over 21488.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2806, pruned_loss=0.06518, over 4266405.68 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:55:55,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.08 vs. limit=22.5 2023-06-26 05:56:12,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1505442.0, ans=0.0 2023-06-26 05:56:29,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1505502.0, ans=0.0 2023-06-26 05:56:34,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1505502.0, ans=0.1 2023-06-26 05:56:43,257 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.242e+02 5.032e+02 6.537e+02 9.718e+02 2.265e+03, threshold=1.307e+03, percent-clipped=8.0 2023-06-26 05:57:08,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-26 05:57:14,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-26 05:57:27,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1505682.0, ans=0.125 2023-06-26 05:57:42,921 INFO [train.py:996] (3/4) Epoch 9, batch 7000, loss[loss=0.2291, simple_loss=0.2932, pruned_loss=0.08252, over 21475.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2827, pruned_loss=0.0675, over 4265675.15 frames. ], batch size: 389, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 05:57:44,593 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.77 vs. limit=15.0 2023-06-26 05:58:43,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1505862.0, ans=0.05 2023-06-26 05:58:44,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-26 05:58:49,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1505862.0, ans=0.04949747468305833 2023-06-26 05:59:24,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1505982.0, ans=0.025 2023-06-26 05:59:24,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1505982.0, ans=0.125 2023-06-26 05:59:35,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1505982.0, ans=0.0 2023-06-26 05:59:38,663 INFO [train.py:996] (3/4) Epoch 9, batch 7050, loss[loss=0.2123, simple_loss=0.2984, pruned_loss=0.06305, over 21606.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2818, pruned_loss=0.06612, over 4261861.47 frames. ], batch size: 414, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:00:11,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1506102.0, ans=0.125 2023-06-26 06:00:27,418 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.188e+02 4.829e+02 6.611e+02 8.594e+02 1.864e+03, threshold=1.322e+03, percent-clipped=11.0 2023-06-26 06:00:44,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1506222.0, ans=0.1 2023-06-26 06:01:07,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1506282.0, ans=0.1 2023-06-26 06:01:33,081 INFO [train.py:996] (3/4) Epoch 9, batch 7100, loss[loss=0.1637, simple_loss=0.2432, pruned_loss=0.04211, over 21351.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2867, pruned_loss=0.06749, over 4264589.26 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:02:14,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-26 06:02:14,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-26 06:02:42,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1506522.0, ans=0.0 2023-06-26 06:02:52,045 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1506522.0, ans=0.0 2023-06-26 06:03:01,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1506582.0, ans=0.125 2023-06-26 06:03:22,347 INFO [train.py:996] (3/4) Epoch 9, batch 7150, loss[loss=0.2115, simple_loss=0.2939, pruned_loss=0.06462, over 21769.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2849, pruned_loss=0.06474, over 4256298.40 frames. ], batch size: 298, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:03:28,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1506642.0, ans=0.0 2023-06-26 06:04:06,074 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.994e+02 4.588e+02 6.424e+02 8.469e+02 2.110e+03, threshold=1.285e+03, percent-clipped=2.0 2023-06-26 06:05:11,694 INFO [train.py:996] (3/4) Epoch 9, batch 7200, loss[loss=0.1985, simple_loss=0.259, pruned_loss=0.069, over 21227.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2873, pruned_loss=0.06643, over 4257059.11 frames. ], batch size: 549, lr: 3.34e-03, grad_scale: 32.0 2023-06-26 06:05:17,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1506942.0, ans=0.125 2023-06-26 06:05:41,900 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:05:49,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1507002.0, ans=0.0 2023-06-26 06:05:54,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1507062.0, ans=0.0 2023-06-26 06:06:29,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1507122.0, ans=0.1 2023-06-26 06:06:31,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1507122.0, ans=0.2 2023-06-26 06:07:00,473 INFO [train.py:996] (3/4) Epoch 9, batch 7250, loss[loss=0.2245, simple_loss=0.2783, pruned_loss=0.08538, over 21393.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2823, pruned_loss=0.06696, over 4263218.50 frames. ], batch size: 475, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:07:20,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1507242.0, ans=0.125 2023-06-26 06:07:45,477 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.193e+02 5.249e+02 7.377e+02 1.151e+03 2.707e+03, threshold=1.475e+03, percent-clipped=23.0 2023-06-26 06:08:00,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1507362.0, ans=0.125 2023-06-26 06:08:33,994 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=22.5 2023-06-26 06:08:40,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1507482.0, ans=0.125 2023-06-26 06:08:42,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-26 06:08:48,791 INFO [train.py:996] (3/4) Epoch 9, batch 7300, loss[loss=0.2045, simple_loss=0.2731, pruned_loss=0.06801, over 15782.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2774, pruned_loss=0.06687, over 4255073.34 frames. ], batch size: 60, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:08:52,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1507542.0, ans=0.125 2023-06-26 06:09:24,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1507602.0, ans=0.125 2023-06-26 06:09:44,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1507662.0, ans=0.07 2023-06-26 06:09:47,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1507662.0, ans=0.125 2023-06-26 06:09:59,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1507722.0, ans=0.125 2023-06-26 06:09:59,943 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-26 06:10:44,112 INFO [train.py:996] (3/4) Epoch 9, batch 7350, loss[loss=0.2804, simple_loss=0.3357, pruned_loss=0.1125, over 21413.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2763, pruned_loss=0.06718, over 4254797.28 frames. ], batch size: 471, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:10:51,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1507842.0, ans=0.125 2023-06-26 06:11:02,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1507902.0, ans=0.1 2023-06-26 06:11:30,326 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.284e+02 4.727e+02 6.627e+02 9.690e+02 1.819e+03, threshold=1.325e+03, percent-clipped=8.0 2023-06-26 06:11:59,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1508022.0, ans=0.125 2023-06-26 06:12:10,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1508082.0, ans=0.125 2023-06-26 06:12:34,183 INFO [train.py:996] (3/4) Epoch 9, batch 7400, loss[loss=0.1878, simple_loss=0.271, pruned_loss=0.05227, over 20784.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2825, pruned_loss=0.06874, over 4251926.48 frames. ], batch size: 607, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:13:50,607 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-26 06:13:57,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1508322.0, ans=0.0 2023-06-26 06:13:58,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1508322.0, ans=0.125 2023-06-26 06:14:03,091 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-26 06:14:04,587 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-26 06:14:25,303 INFO [train.py:996] (3/4) Epoch 9, batch 7450, loss[loss=0.2037, simple_loss=0.2662, pruned_loss=0.07059, over 21288.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2819, pruned_loss=0.06823, over 4256075.67 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:14:36,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1508442.0, ans=0.0 2023-06-26 06:15:09,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1508502.0, ans=0.1 2023-06-26 06:15:23,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.343e+02 4.976e+02 6.577e+02 1.050e+03 2.324e+03, threshold=1.315e+03, percent-clipped=17.0 2023-06-26 06:16:07,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1508682.0, ans=0.0 2023-06-26 06:16:10,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-26 06:16:18,122 INFO [train.py:996] (3/4) Epoch 9, batch 7500, loss[loss=0.2217, simple_loss=0.3152, pruned_loss=0.06412, over 21366.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2898, pruned_loss=0.06931, over 4254631.74 frames. ], batch size: 176, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:18:08,917 INFO [train.py:996] (3/4) Epoch 9, batch 7550, loss[loss=0.2054, simple_loss=0.3032, pruned_loss=0.05374, over 21446.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2968, pruned_loss=0.06875, over 4260602.95 frames. ], batch size: 211, lr: 3.34e-03, grad_scale: 16.0 2023-06-26 06:19:04,850 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.393e+02 6.002e+02 8.588e+02 1.350e+03 2.877e+03, threshold=1.718e+03, percent-clipped=25.0 2023-06-26 06:19:56,675 INFO [train.py:996] (3/4) Epoch 9, batch 7600, loss[loss=0.1951, simple_loss=0.282, pruned_loss=0.05407, over 21646.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2946, pruned_loss=0.06778, over 4272293.76 frames. ], batch size: 263, lr: 3.33e-03, grad_scale: 32.0 2023-06-26 06:21:07,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.12 vs. limit=15.0 2023-06-26 06:21:31,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-26 06:21:44,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1509642.0, ans=0.125 2023-06-26 06:21:46,193 INFO [train.py:996] (3/4) Epoch 9, batch 7650, loss[loss=0.2234, simple_loss=0.2957, pruned_loss=0.07555, over 21365.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2925, pruned_loss=0.069, over 4278589.73 frames. ], batch size: 159, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:21:46,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1509642.0, ans=0.1 2023-06-26 06:22:33,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1509702.0, ans=0.2 2023-06-26 06:22:43,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.446e+02 5.104e+02 7.952e+02 1.146e+03 1.972e+03, threshold=1.590e+03, percent-clipped=6.0 2023-06-26 06:23:04,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1509822.0, ans=0.0 2023-06-26 06:23:05,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-26 06:23:14,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1509822.0, ans=0.2 2023-06-26 06:23:19,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-26 06:23:41,253 INFO [train.py:996] (3/4) Epoch 9, batch 7700, loss[loss=0.3188, simple_loss=0.3636, pruned_loss=0.137, over 21492.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2952, pruned_loss=0.07139, over 4275148.42 frames. ], batch size: 510, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:25:05,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-26 06:25:33,210 INFO [train.py:996] (3/4) Epoch 9, batch 7750, loss[loss=0.243, simple_loss=0.3374, pruned_loss=0.0743, over 21776.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3007, pruned_loss=0.07128, over 4274351.75 frames. ], batch size: 282, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:26:09,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1510302.0, ans=0.125 2023-06-26 06:26:32,281 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.490e+02 5.408e+02 8.578e+02 1.362e+03 2.742e+03, threshold=1.716e+03, percent-clipped=14.0 2023-06-26 06:26:49,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2023-06-26 06:27:34,362 INFO [train.py:996] (3/4) Epoch 9, batch 7800, loss[loss=0.2031, simple_loss=0.2882, pruned_loss=0.05905, over 21812.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.302, pruned_loss=0.07211, over 4274811.83 frames. ], batch size: 333, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:27:36,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1510542.0, ans=0.0 2023-06-26 06:27:42,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1510542.0, ans=0.0 2023-06-26 06:27:52,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1510602.0, ans=0.0 2023-06-26 06:28:47,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1510722.0, ans=0.1 2023-06-26 06:29:23,902 INFO [train.py:996] (3/4) Epoch 9, batch 7850, loss[loss=0.1859, simple_loss=0.2558, pruned_loss=0.05803, over 21768.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2939, pruned_loss=0.07065, over 4267900.93 frames. ], batch size: 317, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:29:57,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1510902.0, ans=0.125 2023-06-26 06:30:12,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.902e+02 7.462e+02 1.114e+03 2.139e+03, threshold=1.492e+03, percent-clipped=5.0 2023-06-26 06:30:36,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1511022.0, ans=0.125 2023-06-26 06:31:14,989 INFO [train.py:996] (3/4) Epoch 9, batch 7900, loss[loss=0.2089, simple_loss=0.2995, pruned_loss=0.05917, over 21611.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2903, pruned_loss=0.06944, over 4270506.93 frames. ], batch size: 263, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:32:01,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1511262.0, ans=0.125 2023-06-26 06:33:07,268 INFO [train.py:996] (3/4) Epoch 9, batch 7950, loss[loss=0.2261, simple_loss=0.3027, pruned_loss=0.07475, over 21146.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2945, pruned_loss=0.06927, over 4265911.68 frames. ], batch size: 143, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:33:21,663 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-26 06:33:22,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1511442.0, ans=0.2 2023-06-26 06:34:02,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.829e+02 6.422e+02 9.281e+02 1.330e+03 3.368e+03, threshold=1.856e+03, percent-clipped=18.0 2023-06-26 06:35:05,370 INFO [train.py:996] (3/4) Epoch 9, batch 8000, loss[loss=0.2438, simple_loss=0.3343, pruned_loss=0.07666, over 21648.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2991, pruned_loss=0.07121, over 4264472.54 frames. ], batch size: 389, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:35:17,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1511742.0, ans=0.0 2023-06-26 06:36:27,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-26 06:36:47,313 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=15.0 2023-06-26 06:37:01,463 INFO [train.py:996] (3/4) Epoch 9, batch 8050, loss[loss=0.2657, simple_loss=0.3515, pruned_loss=0.08999, over 21655.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3006, pruned_loss=0.07115, over 4262720.83 frames. ], batch size: 389, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:37:03,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1512042.0, ans=0.2 2023-06-26 06:37:27,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1512102.0, ans=0.125 2023-06-26 06:38:01,841 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.472e+02 6.276e+02 8.546e+02 1.348e+03 3.651e+03, threshold=1.709e+03, percent-clipped=15.0 2023-06-26 06:38:16,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=15.0 2023-06-26 06:38:43,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-26 06:38:51,609 INFO [train.py:996] (3/4) Epoch 9, batch 8100, loss[loss=0.1701, simple_loss=0.2214, pruned_loss=0.05941, over 20773.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2982, pruned_loss=0.07168, over 4262978.17 frames. ], batch size: 609, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:39:53,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1512462.0, ans=0.125 2023-06-26 06:40:58,166 INFO [train.py:996] (3/4) Epoch 9, batch 8150, loss[loss=0.2055, simple_loss=0.3092, pruned_loss=0.05091, over 20833.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3059, pruned_loss=0.07315, over 4267186.11 frames. ], batch size: 609, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:41:14,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=22.5 2023-06-26 06:41:40,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1512702.0, ans=0.1 2023-06-26 06:41:54,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.538e+02 6.819e+02 1.034e+03 1.568e+03 4.387e+03, threshold=2.069e+03, percent-clipped=18.0 2023-06-26 06:42:07,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1512822.0, ans=0.2 2023-06-26 06:42:26,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1512882.0, ans=0.07 2023-06-26 06:42:43,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.36 vs. limit=6.0 2023-06-26 06:42:49,123 INFO [train.py:996] (3/4) Epoch 9, batch 8200, loss[loss=0.1907, simple_loss=0.2592, pruned_loss=0.06113, over 20820.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2982, pruned_loss=0.07092, over 4268333.57 frames. ], batch size: 609, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:44:40,642 INFO [train.py:996] (3/4) Epoch 9, batch 8250, loss[loss=0.2531, simple_loss=0.3742, pruned_loss=0.06597, over 20767.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2983, pruned_loss=0.07042, over 4267014.94 frames. ], batch size: 607, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:45:36,648 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.396e+02 4.867e+02 7.289e+02 1.042e+03 1.970e+03, threshold=1.458e+03, percent-clipped=0.0 2023-06-26 06:45:56,853 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.33 vs. limit=22.5 2023-06-26 06:46:04,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1513422.0, ans=0.125 2023-06-26 06:46:15,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1513482.0, ans=0.0 2023-06-26 06:46:29,584 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-26 06:46:35,300 INFO [train.py:996] (3/4) Epoch 9, batch 8300, loss[loss=0.1734, simple_loss=0.2513, pruned_loss=0.04771, over 21314.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2956, pruned_loss=0.0683, over 4266901.61 frames. ], batch size: 131, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:46:35,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1513542.0, ans=0.0 2023-06-26 06:46:53,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1513602.0, ans=0.035 2023-06-26 06:47:09,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1513602.0, ans=0.125 2023-06-26 06:47:11,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-26 06:47:49,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1513722.0, ans=0.2 2023-06-26 06:47:50,013 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-26 06:48:13,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1513782.0, ans=0.1 2023-06-26 06:48:15,876 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=15.0 2023-06-26 06:48:19,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1513782.0, ans=0.0 2023-06-26 06:48:25,395 INFO [train.py:996] (3/4) Epoch 9, batch 8350, loss[loss=0.2078, simple_loss=0.2926, pruned_loss=0.06149, over 21482.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2942, pruned_loss=0.06676, over 4269093.19 frames. ], batch size: 389, lr: 3.33e-03, grad_scale: 8.0 2023-06-26 06:49:22,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.156e+02 5.177e+02 7.489e+02 1.153e+03 2.858e+03, threshold=1.498e+03, percent-clipped=11.0 2023-06-26 06:50:05,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1514082.0, ans=0.125 2023-06-26 06:50:05,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-26 06:50:08,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1514082.0, ans=0.0 2023-06-26 06:50:14,409 INFO [train.py:996] (3/4) Epoch 9, batch 8400, loss[loss=0.1902, simple_loss=0.2775, pruned_loss=0.05148, over 21498.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2911, pruned_loss=0.06402, over 4274475.64 frames. ], batch size: 212, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:50:46,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=15.0 2023-06-26 06:51:26,335 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:51:36,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1514322.0, ans=0.0 2023-06-26 06:51:47,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1514382.0, ans=0.1 2023-06-26 06:52:01,965 INFO [train.py:996] (3/4) Epoch 9, batch 8450, loss[loss=0.2212, simple_loss=0.2954, pruned_loss=0.07348, over 21862.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2896, pruned_loss=0.06319, over 4282237.22 frames. ], batch size: 371, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:52:05,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1514442.0, ans=0.125 2023-06-26 06:52:46,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.40 vs. limit=10.0 2023-06-26 06:52:58,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.897e+02 4.182e+02 5.654e+02 7.712e+02 3.428e+03, threshold=1.131e+03, percent-clipped=11.0 2023-06-26 06:53:51,648 INFO [train.py:996] (3/4) Epoch 9, batch 8500, loss[loss=0.1894, simple_loss=0.2655, pruned_loss=0.05663, over 21717.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2871, pruned_loss=0.06412, over 4282741.24 frames. ], batch size: 112, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:53:56,468 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.59 vs. limit=15.0 2023-06-26 06:54:19,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=12.0 2023-06-26 06:54:20,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1514802.0, ans=0.2 2023-06-26 06:54:57,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1514922.0, ans=0.125 2023-06-26 06:55:31,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1514982.0, ans=0.125 2023-06-26 06:55:40,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1514982.0, ans=0.125 2023-06-26 06:55:42,991 INFO [train.py:996] (3/4) Epoch 9, batch 8550, loss[loss=0.2722, simple_loss=0.3708, pruned_loss=0.08684, over 21254.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2921, pruned_loss=0.06739, over 4278706.33 frames. ], batch size: 548, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:55:47,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1515042.0, ans=0.125 2023-06-26 06:56:04,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1515102.0, ans=0.0 2023-06-26 06:56:20,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1515102.0, ans=0.2 2023-06-26 06:56:40,472 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.388e+02 5.673e+02 9.028e+02 1.285e+03 2.973e+03, threshold=1.806e+03, percent-clipped=33.0 2023-06-26 06:56:41,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1515162.0, ans=0.0 2023-06-26 06:57:05,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-26 06:57:22,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=22.5 2023-06-26 06:57:26,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1515282.0, ans=0.125 2023-06-26 06:57:34,130 INFO [train.py:996] (3/4) Epoch 9, batch 8600, loss[loss=0.2087, simple_loss=0.3285, pruned_loss=0.04441, over 19813.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2979, pruned_loss=0.06881, over 4276359.23 frames. ], batch size: 702, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:57:56,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1515402.0, ans=0.0 2023-06-26 06:58:03,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1515402.0, ans=0.125 2023-06-26 06:58:07,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1515402.0, ans=0.07 2023-06-26 06:58:19,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1515402.0, ans=0.0 2023-06-26 06:58:22,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1515462.0, ans=0.0 2023-06-26 06:59:10,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1515582.0, ans=0.125 2023-06-26 06:59:25,165 INFO [train.py:996] (3/4) Epoch 9, batch 8650, loss[loss=0.2396, simple_loss=0.3288, pruned_loss=0.07526, over 21436.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3028, pruned_loss=0.06904, over 4275318.67 frames. ], batch size: 507, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 06:59:28,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1515642.0, ans=0.125 2023-06-26 07:00:02,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-26 07:00:10,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1515762.0, ans=0.125 2023-06-26 07:00:25,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.107e+02 4.849e+02 6.283e+02 8.957e+02 2.012e+03, threshold=1.257e+03, percent-clipped=3.0 2023-06-26 07:00:56,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1515882.0, ans=0.125 2023-06-26 07:01:11,966 INFO [train.py:996] (3/4) Epoch 9, batch 8700, loss[loss=0.2011, simple_loss=0.2688, pruned_loss=0.06673, over 21826.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2947, pruned_loss=0.06692, over 4264894.76 frames. ], batch size: 112, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:01:16,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=12.0 2023-06-26 07:01:31,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1516002.0, ans=0.07 2023-06-26 07:02:01,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1516062.0, ans=0.125 2023-06-26 07:02:54,669 INFO [train.py:996] (3/4) Epoch 9, batch 8750, loss[loss=0.2261, simple_loss=0.2847, pruned_loss=0.0837, over 20120.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2905, pruned_loss=0.06797, over 4265587.06 frames. ], batch size: 703, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:04:02,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 4.871e+02 5.858e+02 9.020e+02 2.163e+03, threshold=1.172e+03, percent-clipped=9.0 2023-06-26 07:04:51,203 INFO [train.py:996] (3/4) Epoch 9, batch 8800, loss[loss=0.2436, simple_loss=0.3269, pruned_loss=0.0801, over 21681.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3005, pruned_loss=0.07129, over 4270612.28 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 32.0 2023-06-26 07:05:28,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.78 vs. limit=5.0 2023-06-26 07:06:16,302 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-26 07:06:30,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=22.5 2023-06-26 07:06:31,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-26 07:06:46,515 INFO [train.py:996] (3/4) Epoch 9, batch 8850, loss[loss=0.2249, simple_loss=0.3224, pruned_loss=0.0637, over 20956.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3075, pruned_loss=0.07352, over 4278419.46 frames. ], batch size: 607, lr: 3.33e-03, grad_scale: 32.0 2023-06-26 07:07:29,975 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-26 07:07:38,739 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=12.0 2023-06-26 07:07:43,090 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.458e+02 5.018e+02 7.490e+02 1.008e+03 2.036e+03, threshold=1.498e+03, percent-clipped=19.0 2023-06-26 07:07:47,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.51 vs. limit=15.0 2023-06-26 07:07:50,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1517022.0, ans=0.0 2023-06-26 07:07:52,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=12.0 2023-06-26 07:07:56,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1517022.0, ans=0.125 2023-06-26 07:08:12,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.14 vs. limit=15.0 2023-06-26 07:08:35,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1517142.0, ans=0.125 2023-06-26 07:08:37,019 INFO [train.py:996] (3/4) Epoch 9, batch 8900, loss[loss=0.1972, simple_loss=0.2708, pruned_loss=0.06183, over 21417.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3024, pruned_loss=0.07255, over 4279306.10 frames. ], batch size: 194, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:09:16,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=22.5 2023-06-26 07:09:38,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-26 07:10:34,127 INFO [train.py:996] (3/4) Epoch 9, batch 8950, loss[loss=0.2561, simple_loss=0.3461, pruned_loss=0.08311, over 21624.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3039, pruned_loss=0.07189, over 4282329.74 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:10:49,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-26 07:11:03,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1517502.0, ans=0.2 2023-06-26 07:11:31,713 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.717e+02 6.385e+02 1.007e+03 1.831e+03 3.231e+03, threshold=2.014e+03, percent-clipped=34.0 2023-06-26 07:11:34,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1517562.0, ans=0.025 2023-06-26 07:11:55,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1517682.0, ans=0.125 2023-06-26 07:12:29,531 INFO [train.py:996] (3/4) Epoch 9, batch 9000, loss[loss=0.197, simple_loss=0.2642, pruned_loss=0.06497, over 21722.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2967, pruned_loss=0.0708, over 4283770.65 frames. ], batch size: 300, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:12:29,532 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 07:12:47,773 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2687, simple_loss=0.357, pruned_loss=0.09027, over 1796401.00 frames. 2023-06-26 07:12:47,773 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 07:13:10,549 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-26 07:13:17,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-26 07:14:25,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1517982.0, ans=0.0 2023-06-26 07:14:38,784 INFO [train.py:996] (3/4) Epoch 9, batch 9050, loss[loss=0.1931, simple_loss=0.2762, pruned_loss=0.05504, over 21667.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2941, pruned_loss=0.06765, over 4276788.91 frames. ], batch size: 298, lr: 3.33e-03, grad_scale: 16.0 2023-06-26 07:14:55,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1518102.0, ans=0.2 2023-06-26 07:15:19,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1518162.0, ans=0.125 2023-06-26 07:15:38,266 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.242e+02 4.774e+02 6.783e+02 1.195e+03 2.023e+03, threshold=1.357e+03, percent-clipped=1.0 2023-06-26 07:16:30,155 INFO [train.py:996] (3/4) Epoch 9, batch 9100, loss[loss=0.2439, simple_loss=0.3372, pruned_loss=0.07529, over 21603.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3004, pruned_loss=0.07047, over 4279613.74 frames. ], batch size: 414, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:16:57,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1518402.0, ans=0.1 2023-06-26 07:17:03,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1518402.0, ans=0.0 2023-06-26 07:17:04,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1518402.0, ans=0.1 2023-06-26 07:17:32,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=22.5 2023-06-26 07:17:46,248 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:18:14,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1518582.0, ans=0.125 2023-06-26 07:18:20,684 INFO [train.py:996] (3/4) Epoch 9, batch 9150, loss[loss=0.2013, simple_loss=0.2939, pruned_loss=0.05436, over 21374.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.303, pruned_loss=0.0683, over 4276319.20 frames. ], batch size: 194, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:18:21,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1518642.0, ans=0.0 2023-06-26 07:18:21,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1518642.0, ans=0.125 2023-06-26 07:19:07,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1518762.0, ans=0.125 2023-06-26 07:19:29,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.947e+02 4.682e+02 7.293e+02 9.875e+02 2.025e+03, threshold=1.459e+03, percent-clipped=11.0 2023-06-26 07:19:47,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1518822.0, ans=0.125 2023-06-26 07:19:54,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1518882.0, ans=0.125 2023-06-26 07:20:04,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1518882.0, ans=0.0 2023-06-26 07:20:14,555 INFO [train.py:996] (3/4) Epoch 9, batch 9200, loss[loss=0.2814, simple_loss=0.3588, pruned_loss=0.102, over 21819.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3048, pruned_loss=0.06805, over 4283059.26 frames. ], batch size: 118, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:20:17,190 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:20:49,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1519002.0, ans=0.125 2023-06-26 07:21:21,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1519062.0, ans=0.0 2023-06-26 07:21:26,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1519122.0, ans=0.125 2023-06-26 07:22:03,212 INFO [train.py:996] (3/4) Epoch 9, batch 9250, loss[loss=0.2402, simple_loss=0.3091, pruned_loss=0.08564, over 21490.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3078, pruned_loss=0.07118, over 4278633.75 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:22:26,103 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.43 vs. limit=12.0 2023-06-26 07:23:06,654 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.432e+02 5.072e+02 7.125e+02 1.070e+03 2.650e+03, threshold=1.425e+03, percent-clipped=11.0 2023-06-26 07:23:34,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1519482.0, ans=0.1 2023-06-26 07:23:52,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1519542.0, ans=0.0 2023-06-26 07:23:53,074 INFO [train.py:996] (3/4) Epoch 9, batch 9300, loss[loss=0.2052, simple_loss=0.2717, pruned_loss=0.06932, over 21111.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3026, pruned_loss=0.0706, over 4273697.19 frames. ], batch size: 176, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:24:14,621 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-06-26 07:24:17,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1519602.0, ans=0.0 2023-06-26 07:24:56,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1519662.0, ans=0.0 2023-06-26 07:25:01,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1519722.0, ans=0.0 2023-06-26 07:25:10,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1519722.0, ans=0.125 2023-06-26 07:25:43,675 INFO [train.py:996] (3/4) Epoch 9, batch 9350, loss[loss=0.2496, simple_loss=0.3315, pruned_loss=0.08382, over 21536.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3076, pruned_loss=0.07092, over 4274357.34 frames. ], batch size: 194, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:26:22,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1519902.0, ans=0.0 2023-06-26 07:26:46,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1519962.0, ans=0.1 2023-06-26 07:26:46,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1519962.0, ans=0.0 2023-06-26 07:26:54,935 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.694e+02 5.143e+02 7.806e+02 1.433e+03 2.856e+03, threshold=1.561e+03, percent-clipped=26.0 2023-06-26 07:27:02,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1520022.0, ans=0.5 2023-06-26 07:27:02,876 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=15.0 2023-06-26 07:27:03,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1520022.0, ans=0.125 2023-06-26 07:27:12,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1520022.0, ans=0.125 2023-06-26 07:27:32,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1520082.0, ans=0.0 2023-06-26 07:27:38,710 INFO [train.py:996] (3/4) Epoch 9, batch 9400, loss[loss=0.2044, simple_loss=0.2705, pruned_loss=0.06912, over 21592.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3077, pruned_loss=0.07163, over 4273979.70 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:27:47,533 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.25 vs. limit=22.5 2023-06-26 07:28:08,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1520202.0, ans=0.125 2023-06-26 07:28:55,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1520322.0, ans=0.125 2023-06-26 07:29:31,958 INFO [train.py:996] (3/4) Epoch 9, batch 9450, loss[loss=0.2004, simple_loss=0.264, pruned_loss=0.06842, over 21579.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2988, pruned_loss=0.07041, over 4277860.58 frames. ], batch size: 415, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:29:53,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1520502.0, ans=0.0 2023-06-26 07:30:31,555 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.480e+02 5.776e+02 8.947e+02 1.514e+03 4.644e+03, threshold=1.789e+03, percent-clipped=22.0 2023-06-26 07:31:21,061 INFO [train.py:996] (3/4) Epoch 9, batch 9500, loss[loss=0.1751, simple_loss=0.2547, pruned_loss=0.0478, over 21667.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2929, pruned_loss=0.06853, over 4265950.81 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:31:46,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1520802.0, ans=0.1 2023-06-26 07:32:04,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1520862.0, ans=0.125 2023-06-26 07:33:12,880 INFO [train.py:996] (3/4) Epoch 9, batch 9550, loss[loss=0.2233, simple_loss=0.3179, pruned_loss=0.0643, over 19773.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2969, pruned_loss=0.07111, over 4267971.59 frames. ], batch size: 703, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:33:41,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1521102.0, ans=6.0 2023-06-26 07:33:44,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1521102.0, ans=0.0 2023-06-26 07:33:47,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1521102.0, ans=0.07 2023-06-26 07:33:51,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=22.5 2023-06-26 07:33:53,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1521162.0, ans=0.015 2023-06-26 07:33:53,451 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:34:06,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1521162.0, ans=0.0 2023-06-26 07:34:11,605 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.380e+02 4.672e+02 5.675e+02 8.285e+02 1.544e+03, threshold=1.135e+03, percent-clipped=0.0 2023-06-26 07:35:01,318 INFO [train.py:996] (3/4) Epoch 9, batch 9600, loss[loss=0.2095, simple_loss=0.2834, pruned_loss=0.06779, over 21905.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2985, pruned_loss=0.07243, over 4275538.28 frames. ], batch size: 107, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:35:15,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1521342.0, ans=0.2 2023-06-26 07:36:52,869 INFO [train.py:996] (3/4) Epoch 9, batch 9650, loss[loss=0.2536, simple_loss=0.333, pruned_loss=0.08706, over 21515.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2986, pruned_loss=0.07202, over 4280197.92 frames. ], batch size: 131, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:37:49,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.333e+02 4.623e+02 6.972e+02 1.187e+03 2.800e+03, threshold=1.394e+03, percent-clipped=26.0 2023-06-26 07:38:00,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1521822.0, ans=0.125 2023-06-26 07:38:38,186 INFO [train.py:996] (3/4) Epoch 9, batch 9700, loss[loss=0.2151, simple_loss=0.2978, pruned_loss=0.06621, over 21672.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3005, pruned_loss=0.07207, over 4284120.96 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:39:58,253 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-26 07:40:27,224 INFO [train.py:996] (3/4) Epoch 9, batch 9750, loss[loss=0.2653, simple_loss=0.3623, pruned_loss=0.08416, over 21826.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2942, pruned_loss=0.07089, over 4291152.03 frames. ], batch size: 118, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:40:39,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1522242.0, ans=0.0 2023-06-26 07:41:12,140 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:41:22,045 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.554e+02 4.804e+02 6.885e+02 8.968e+02 2.424e+03, threshold=1.377e+03, percent-clipped=5.0 2023-06-26 07:41:45,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1522482.0, ans=0.125 2023-06-26 07:42:07,396 INFO [train.py:996] (3/4) Epoch 9, batch 9800, loss[loss=0.2157, simple_loss=0.2888, pruned_loss=0.07127, over 21910.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2953, pruned_loss=0.07086, over 4287797.87 frames. ], batch size: 351, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:42:24,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-26 07:42:36,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1522602.0, ans=0.125 2023-06-26 07:42:44,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522602.0, ans=0.1 2023-06-26 07:42:45,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1522602.0, ans=0.2 2023-06-26 07:42:59,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1522662.0, ans=0.125 2023-06-26 07:42:59,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1522662.0, ans=0.0 2023-06-26 07:43:04,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1522662.0, ans=0.05 2023-06-26 07:43:45,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1522782.0, ans=0.0 2023-06-26 07:43:46,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1522782.0, ans=0.125 2023-06-26 07:43:52,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1522782.0, ans=0.125 2023-06-26 07:43:57,269 INFO [train.py:996] (3/4) Epoch 9, batch 9850, loss[loss=0.2021, simple_loss=0.2726, pruned_loss=0.06575, over 21788.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2923, pruned_loss=0.07042, over 4276658.01 frames. ], batch size: 333, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:44:24,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1522902.0, ans=0.0 2023-06-26 07:44:33,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522902.0, ans=0.1 2023-06-26 07:44:33,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522902.0, ans=0.1 2023-06-26 07:44:37,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1522902.0, ans=0.125 2023-06-26 07:44:44,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.27 vs. limit=10.0 2023-06-26 07:44:45,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1522962.0, ans=0.2 2023-06-26 07:44:58,516 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.446e+02 4.827e+02 6.671e+02 1.006e+03 2.121e+03, threshold=1.334e+03, percent-clipped=9.0 2023-06-26 07:45:28,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1523082.0, ans=0.07 2023-06-26 07:45:52,815 INFO [train.py:996] (3/4) Epoch 9, batch 9900, loss[loss=0.263, simple_loss=0.3319, pruned_loss=0.09702, over 21418.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2902, pruned_loss=0.06999, over 4272417.29 frames. ], batch size: 159, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:46:24,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1523202.0, ans=0.1 2023-06-26 07:46:27,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1523202.0, ans=0.1 2023-06-26 07:46:31,962 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.11 vs. limit=10.0 2023-06-26 07:46:47,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.99 vs. limit=15.0 2023-06-26 07:46:50,953 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:47:01,894 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=22.5 2023-06-26 07:47:35,633 INFO [train.py:996] (3/4) Epoch 9, batch 9950, loss[loss=0.2241, simple_loss=0.2979, pruned_loss=0.07515, over 21826.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2915, pruned_loss=0.07137, over 4273699.47 frames. ], batch size: 118, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:48:06,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1523502.0, ans=0.125 2023-06-26 07:48:13,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1523502.0, ans=0.125 2023-06-26 07:48:15,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1523502.0, ans=0.125 2023-06-26 07:48:38,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.372e+02 4.966e+02 6.562e+02 9.646e+02 1.795e+03, threshold=1.312e+03, percent-clipped=7.0 2023-06-26 07:49:09,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1523682.0, ans=0.2 2023-06-26 07:49:11,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1523682.0, ans=0.125 2023-06-26 07:49:16,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1523682.0, ans=0.125 2023-06-26 07:49:31,812 INFO [train.py:996] (3/4) Epoch 9, batch 10000, loss[loss=0.1841, simple_loss=0.2569, pruned_loss=0.05567, over 21768.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2865, pruned_loss=0.07019, over 4271001.02 frames. ], batch size: 282, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:50:11,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.65 vs. limit=15.0 2023-06-26 07:51:22,407 INFO [train.py:996] (3/4) Epoch 9, batch 10050, loss[loss=0.18, simple_loss=0.2562, pruned_loss=0.05191, over 21424.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2904, pruned_loss=0.07142, over 4276374.86 frames. ], batch size: 211, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:51:25,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1524042.0, ans=0.125 2023-06-26 07:51:27,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1524042.0, ans=15.0 2023-06-26 07:51:30,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=12.0 2023-06-26 07:51:48,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1524102.0, ans=0.09899494936611666 2023-06-26 07:52:01,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1524102.0, ans=0.125 2023-06-26 07:52:31,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.556e+02 5.086e+02 7.732e+02 1.194e+03 2.294e+03, threshold=1.546e+03, percent-clipped=16.0 2023-06-26 07:52:39,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.39 vs. limit=12.0 2023-06-26 07:53:05,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1524282.0, ans=0.04949747468305833 2023-06-26 07:53:13,018 INFO [train.py:996] (3/4) Epoch 9, batch 10100, loss[loss=0.2327, simple_loss=0.2933, pruned_loss=0.08611, over 20264.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2884, pruned_loss=0.06926, over 4269145.28 frames. ], batch size: 707, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 07:53:22,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-26 07:53:55,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1524402.0, ans=0.1 2023-06-26 07:53:59,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1524462.0, ans=0.125 2023-06-26 07:54:01,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1524462.0, ans=0.2 2023-06-26 07:54:15,794 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-26 07:54:18,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1524462.0, ans=0.0 2023-06-26 07:54:40,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=15.0 2023-06-26 07:55:07,062 INFO [train.py:996] (3/4) Epoch 9, batch 10150, loss[loss=0.2068, simple_loss=0.2788, pruned_loss=0.06739, over 21657.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2927, pruned_loss=0.07134, over 4268890.60 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:55:16,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1524642.0, ans=0.1 2023-06-26 07:55:28,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1524702.0, ans=0.0 2023-06-26 07:56:02,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1524762.0, ans=10.0 2023-06-26 07:56:10,905 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.445e+02 5.423e+02 7.380e+02 1.011e+03 1.635e+03, threshold=1.476e+03, percent-clipped=1.0 2023-06-26 07:56:29,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=22.5 2023-06-26 07:56:56,530 INFO [train.py:996] (3/4) Epoch 9, batch 10200, loss[loss=0.196, simple_loss=0.2872, pruned_loss=0.05242, over 21678.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.292, pruned_loss=0.06956, over 4255320.88 frames. ], batch size: 391, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:56:59,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1524942.0, ans=0.125 2023-06-26 07:57:48,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1525062.0, ans=0.0 2023-06-26 07:58:15,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1525122.0, ans=0.2 2023-06-26 07:58:15,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1525122.0, ans=0.0 2023-06-26 07:58:47,141 INFO [train.py:996] (3/4) Epoch 9, batch 10250, loss[loss=0.1566, simple_loss=0.2492, pruned_loss=0.03204, over 21555.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2871, pruned_loss=0.06376, over 4267646.80 frames. ], batch size: 230, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 07:59:30,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1525302.0, ans=0.125 2023-06-26 07:59:58,341 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.778e+02 4.201e+02 6.167e+02 1.103e+03 3.116e+03, threshold=1.233e+03, percent-clipped=15.0 2023-06-26 08:00:38,949 INFO [train.py:996] (3/4) Epoch 9, batch 10300, loss[loss=0.2185, simple_loss=0.32, pruned_loss=0.05846, over 21809.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2911, pruned_loss=0.06523, over 4275997.16 frames. ], batch size: 282, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:00:40,075 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-06-26 08:00:42,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1525542.0, ans=0.0 2023-06-26 08:02:24,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1525782.0, ans=0.125 2023-06-26 08:02:30,552 INFO [train.py:996] (3/4) Epoch 9, batch 10350, loss[loss=0.2068, simple_loss=0.2856, pruned_loss=0.06403, over 21678.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2925, pruned_loss=0.0656, over 4275230.48 frames. ], batch size: 351, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:03:04,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.41 vs. limit=12.0 2023-06-26 08:03:07,320 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:03:46,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.376e+02 5.119e+02 7.830e+02 1.250e+03 2.539e+03, threshold=1.566e+03, percent-clipped=25.0 2023-06-26 08:04:03,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-26 08:04:14,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1526082.0, ans=0.125 2023-06-26 08:04:33,026 INFO [train.py:996] (3/4) Epoch 9, batch 10400, loss[loss=0.2749, simple_loss=0.3405, pruned_loss=0.1046, over 21455.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2877, pruned_loss=0.06565, over 4264471.57 frames. ], batch size: 507, lr: 3.32e-03, grad_scale: 32.0 2023-06-26 08:06:20,417 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-26 08:06:24,909 INFO [train.py:996] (3/4) Epoch 9, batch 10450, loss[loss=0.2317, simple_loss=0.3082, pruned_loss=0.0776, over 21764.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.291, pruned_loss=0.06845, over 4273721.56 frames. ], batch size: 247, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:07:29,718 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.606e+02 5.261e+02 7.908e+02 1.020e+03 2.061e+03, threshold=1.582e+03, percent-clipped=9.0 2023-06-26 08:07:41,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1526622.0, ans=0.1 2023-06-26 08:08:14,060 INFO [train.py:996] (3/4) Epoch 9, batch 10500, loss[loss=0.2111, simple_loss=0.2724, pruned_loss=0.07488, over 21489.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2898, pruned_loss=0.06748, over 4274394.03 frames. ], batch size: 441, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:09:39,147 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-26 08:09:42,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1526922.0, ans=0.2 2023-06-26 08:09:44,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1526982.0, ans=0.07 2023-06-26 08:10:02,797 INFO [train.py:996] (3/4) Epoch 9, batch 10550, loss[loss=0.1845, simple_loss=0.2501, pruned_loss=0.05947, over 21755.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2846, pruned_loss=0.06689, over 4262495.57 frames. ], batch size: 124, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:10:43,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=22.5 2023-06-26 08:10:44,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1527102.0, ans=0.125 2023-06-26 08:11:01,323 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-26 08:11:07,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.351e+02 4.011e+02 5.575e+02 6.702e+02 2.123e+03, threshold=1.115e+03, percent-clipped=3.0 2023-06-26 08:11:45,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1527282.0, ans=0.0 2023-06-26 08:11:47,870 INFO [train.py:996] (3/4) Epoch 9, batch 10600, loss[loss=0.1749, simple_loss=0.2665, pruned_loss=0.04164, over 21640.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2802, pruned_loss=0.06553, over 4263195.94 frames. ], batch size: 263, lr: 3.32e-03, grad_scale: 16.0 2023-06-26 08:13:16,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1527522.0, ans=0.0 2023-06-26 08:13:44,631 INFO [train.py:996] (3/4) Epoch 9, batch 10650, loss[loss=0.1662, simple_loss=0.2556, pruned_loss=0.03842, over 21833.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2824, pruned_loss=0.0644, over 4268255.80 frames. ], batch size: 317, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:14:11,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1527702.0, ans=0.05 2023-06-26 08:14:49,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.095e+02 4.930e+02 8.313e+02 1.262e+03 3.074e+03, threshold=1.663e+03, percent-clipped=34.0 2023-06-26 08:14:50,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1527822.0, ans=0.2 2023-06-26 08:15:18,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1527882.0, ans=0.1 2023-06-26 08:15:34,231 INFO [train.py:996] (3/4) Epoch 9, batch 10700, loss[loss=0.2484, simple_loss=0.3236, pruned_loss=0.08658, over 21309.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2824, pruned_loss=0.06429, over 4262941.68 frames. ], batch size: 159, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:16:02,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-26 08:16:04,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1528002.0, ans=0.0 2023-06-26 08:16:24,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1528062.0, ans=0.125 2023-06-26 08:16:41,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1528122.0, ans=0.125 2023-06-26 08:17:17,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1528182.0, ans=0.2 2023-06-26 08:17:20,277 INFO [train.py:996] (3/4) Epoch 9, batch 10750, loss[loss=0.2321, simple_loss=0.3138, pruned_loss=0.07524, over 21266.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2922, pruned_loss=0.06783, over 4259531.78 frames. ], batch size: 176, lr: 3.31e-03, grad_scale: 8.0 2023-06-26 08:17:38,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1528302.0, ans=0.2 2023-06-26 08:18:02,587 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-26 08:18:33,231 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 4.303e+02 6.075e+02 7.797e+02 1.997e+03, threshold=1.215e+03, percent-clipped=3.0 2023-06-26 08:19:10,501 INFO [train.py:996] (3/4) Epoch 9, batch 10800, loss[loss=0.2361, simple_loss=0.3108, pruned_loss=0.08073, over 21760.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2964, pruned_loss=0.06832, over 4262682.44 frames. ], batch size: 332, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:19:43,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1528602.0, ans=0.0 2023-06-26 08:20:02,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1528662.0, ans=0.09899494936611666 2023-06-26 08:20:15,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1528662.0, ans=0.0 2023-06-26 08:20:41,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1528782.0, ans=0.125 2023-06-26 08:20:46,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1528782.0, ans=0.125 2023-06-26 08:21:00,996 INFO [train.py:996] (3/4) Epoch 9, batch 10850, loss[loss=0.1849, simple_loss=0.2612, pruned_loss=0.05429, over 21271.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.297, pruned_loss=0.06835, over 4261252.89 frames. ], batch size: 131, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:22:19,321 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.316e+02 4.810e+02 7.791e+02 1.214e+03 2.371e+03, threshold=1.558e+03, percent-clipped=23.0 2023-06-26 08:22:28,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1529022.0, ans=0.1 2023-06-26 08:22:56,742 INFO [train.py:996] (3/4) Epoch 9, batch 10900, loss[loss=0.2237, simple_loss=0.3219, pruned_loss=0.06279, over 20839.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2913, pruned_loss=0.06659, over 4247344.54 frames. ], batch size: 609, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:23:04,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1529142.0, ans=0.0 2023-06-26 08:23:34,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1529202.0, ans=0.125 2023-06-26 08:23:35,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1529202.0, ans=0.1 2023-06-26 08:24:05,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1529322.0, ans=0.125 2023-06-26 08:24:06,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=12.0 2023-06-26 08:24:44,096 INFO [train.py:996] (3/4) Epoch 9, batch 10950, loss[loss=0.1819, simple_loss=0.2511, pruned_loss=0.05639, over 21542.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2871, pruned_loss=0.06543, over 4244377.78 frames. ], batch size: 263, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:25:11,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1529502.0, ans=0.1 2023-06-26 08:25:32,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1529562.0, ans=0.125 2023-06-26 08:25:45,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1529562.0, ans=0.125 2023-06-26 08:25:52,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1529622.0, ans=0.1 2023-06-26 08:25:55,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.406e+02 4.859e+02 7.093e+02 1.092e+03 2.550e+03, threshold=1.419e+03, percent-clipped=10.0 2023-06-26 08:26:17,870 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.85 vs. limit=15.0 2023-06-26 08:26:25,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1529742.0, ans=0.0 2023-06-26 08:26:26,611 INFO [train.py:996] (3/4) Epoch 9, batch 11000, loss[loss=0.2158, simple_loss=0.2761, pruned_loss=0.07776, over 20040.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2859, pruned_loss=0.06562, over 4237816.93 frames. ], batch size: 703, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:27:58,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1529982.0, ans=0.1 2023-06-26 08:28:20,347 INFO [train.py:996] (3/4) Epoch 9, batch 11050, loss[loss=0.2274, simple_loss=0.2665, pruned_loss=0.09415, over 21388.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2835, pruned_loss=0.06708, over 4233850.15 frames. ], batch size: 508, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:28:28,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1530042.0, ans=0.2 2023-06-26 08:28:31,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1530042.0, ans=0.125 2023-06-26 08:28:42,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1530102.0, ans=0.125 2023-06-26 08:28:58,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1530102.0, ans=10.0 2023-06-26 08:29:03,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1530102.0, ans=0.0 2023-06-26 08:29:08,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1530162.0, ans=0.1 2023-06-26 08:29:14,862 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-26 08:29:15,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1530162.0, ans=0.07 2023-06-26 08:29:27,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1530222.0, ans=0.2 2023-06-26 08:29:31,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-26 08:29:32,131 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.162e+02 4.865e+02 7.286e+02 1.085e+03 1.953e+03, threshold=1.457e+03, percent-clipped=8.0 2023-06-26 08:30:03,324 INFO [train.py:996] (3/4) Epoch 9, batch 11100, loss[loss=0.215, simple_loss=0.2831, pruned_loss=0.07344, over 21842.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2816, pruned_loss=0.06699, over 4245102.25 frames. ], batch size: 98, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:31:41,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1530582.0, ans=0.2 2023-06-26 08:31:57,809 INFO [train.py:996] (3/4) Epoch 9, batch 11150, loss[loss=0.2104, simple_loss=0.2981, pruned_loss=0.06134, over 21203.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2803, pruned_loss=0.06694, over 4248235.26 frames. ], batch size: 159, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:32:00,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1530642.0, ans=0.0 2023-06-26 08:32:00,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1530642.0, ans=0.1 2023-06-26 08:33:09,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.403e+02 4.594e+02 7.408e+02 1.103e+03 2.164e+03, threshold=1.482e+03, percent-clipped=12.0 2023-06-26 08:33:40,337 INFO [train.py:996] (3/4) Epoch 9, batch 11200, loss[loss=0.2044, simple_loss=0.2614, pruned_loss=0.07375, over 21769.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2795, pruned_loss=0.06659, over 4237252.64 frames. ], batch size: 112, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 08:34:38,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1531062.0, ans=0.125 2023-06-26 08:34:56,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1531122.0, ans=0.125 2023-06-26 08:35:20,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1531182.0, ans=0.125 2023-06-26 08:35:30,885 INFO [train.py:996] (3/4) Epoch 9, batch 11250, loss[loss=0.2036, simple_loss=0.2749, pruned_loss=0.06616, over 21367.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2784, pruned_loss=0.06643, over 4247628.77 frames. ], batch size: 548, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 08:36:02,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1531302.0, ans=0.1 2023-06-26 08:36:49,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1531422.0, ans=0.5 2023-06-26 08:36:50,827 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.393e+02 4.914e+02 6.866e+02 9.264e+02 1.730e+03, threshold=1.373e+03, percent-clipped=7.0 2023-06-26 08:37:20,318 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-26 08:37:20,707 INFO [train.py:996] (3/4) Epoch 9, batch 11300, loss[loss=0.2071, simple_loss=0.2812, pruned_loss=0.06645, over 21335.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2814, pruned_loss=0.06693, over 4257445.80 frames. ], batch size: 159, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:38:23,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1531662.0, ans=0.0 2023-06-26 08:38:36,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1531722.0, ans=0.2 2023-06-26 08:39:06,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1531782.0, ans=0.02 2023-06-26 08:39:16,333 INFO [train.py:996] (3/4) Epoch 9, batch 11350, loss[loss=0.2365, simple_loss=0.3212, pruned_loss=0.07591, over 21711.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2831, pruned_loss=0.06626, over 4260783.40 frames. ], batch size: 351, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:39:30,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1531842.0, ans=0.0 2023-06-26 08:40:31,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.547e+02 4.947e+02 6.813e+02 1.038e+03 3.040e+03, threshold=1.363e+03, percent-clipped=13.0 2023-06-26 08:41:08,346 INFO [train.py:996] (3/4) Epoch 9, batch 11400, loss[loss=0.173, simple_loss=0.2287, pruned_loss=0.05867, over 16719.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2886, pruned_loss=0.06808, over 4259004.10 frames. ], batch size: 61, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:41:57,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1532202.0, ans=0.035 2023-06-26 08:42:09,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1532262.0, ans=0.125 2023-06-26 08:43:04,955 INFO [train.py:996] (3/4) Epoch 9, batch 11450, loss[loss=0.2094, simple_loss=0.2986, pruned_loss=0.06015, over 21522.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2902, pruned_loss=0.06752, over 4262252.16 frames. ], batch size: 471, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:43:54,489 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-26 08:44:04,681 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-26 08:44:13,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1532622.0, ans=0.125 2023-06-26 08:44:14,729 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.477e+02 5.112e+02 7.054e+02 1.112e+03 2.275e+03, threshold=1.411e+03, percent-clipped=15.0 2023-06-26 08:44:16,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1532622.0, ans=0.0 2023-06-26 08:45:01,344 INFO [train.py:996] (3/4) Epoch 9, batch 11500, loss[loss=0.1816, simple_loss=0.2753, pruned_loss=0.04397, over 21424.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2929, pruned_loss=0.06793, over 4268355.26 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:45:46,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1532862.0, ans=0.125 2023-06-26 08:45:46,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1532862.0, ans=0.0 2023-06-26 08:46:30,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1532982.0, ans=0.125 2023-06-26 08:46:53,159 INFO [train.py:996] (3/4) Epoch 9, batch 11550, loss[loss=0.2846, simple_loss=0.4144, pruned_loss=0.07744, over 21178.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3002, pruned_loss=0.06841, over 4270561.41 frames. ], batch size: 548, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:47:19,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1533102.0, ans=0.0 2023-06-26 08:48:08,421 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 5.865e+02 8.299e+02 1.163e+03 3.420e+03, threshold=1.660e+03, percent-clipped=18.0 2023-06-26 08:48:31,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1533282.0, ans=0.0 2023-06-26 08:48:48,927 INFO [train.py:996] (3/4) Epoch 9, batch 11600, loss[loss=0.2311, simple_loss=0.3237, pruned_loss=0.06923, over 21799.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3125, pruned_loss=0.07069, over 4273522.61 frames. ], batch size: 124, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 08:49:13,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1533402.0, ans=0.125 2023-06-26 08:49:13,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1533402.0, ans=0.2 2023-06-26 08:49:41,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1533462.0, ans=0.0 2023-06-26 08:50:06,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1533522.0, ans=0.125 2023-06-26 08:50:10,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1533522.0, ans=0.2 2023-06-26 08:50:14,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1533582.0, ans=0.07 2023-06-26 08:50:35,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=22.5 2023-06-26 08:50:37,990 INFO [train.py:996] (3/4) Epoch 9, batch 11650, loss[loss=0.2032, simple_loss=0.2784, pruned_loss=0.06396, over 21837.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3195, pruned_loss=0.07214, over 4276972.49 frames. ], batch size: 107, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:50:40,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1533642.0, ans=0.125 2023-06-26 08:50:50,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1533642.0, ans=0.125 2023-06-26 08:50:52,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1533642.0, ans=0.0 2023-06-26 08:51:52,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.214e+02 7.495e+02 1.149e+03 1.864e+03 4.386e+03, threshold=2.298e+03, percent-clipped=28.0 2023-06-26 08:51:56,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1533822.0, ans=0.125 2023-06-26 08:52:19,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1533882.0, ans=0.025 2023-06-26 08:52:26,015 INFO [train.py:996] (3/4) Epoch 9, batch 11700, loss[loss=0.1867, simple_loss=0.2564, pruned_loss=0.05853, over 21673.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.31, pruned_loss=0.07119, over 4273924.20 frames. ], batch size: 282, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:54:00,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1534182.0, ans=0.0 2023-06-26 08:54:13,594 INFO [train.py:996] (3/4) Epoch 9, batch 11750, loss[loss=0.2002, simple_loss=0.2627, pruned_loss=0.06882, over 21627.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3, pruned_loss=0.07093, over 4272556.96 frames. ], batch size: 231, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:54:26,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1534242.0, ans=0.1 2023-06-26 08:55:00,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1534362.0, ans=0.0 2023-06-26 08:55:31,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.373e+02 4.368e+02 6.221e+02 1.023e+03 2.709e+03, threshold=1.244e+03, percent-clipped=2.0 2023-06-26 08:56:03,961 INFO [train.py:996] (3/4) Epoch 9, batch 11800, loss[loss=0.2336, simple_loss=0.3032, pruned_loss=0.08195, over 21392.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3028, pruned_loss=0.07294, over 4268022.19 frames. ], batch size: 549, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:56:28,505 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.02 vs. limit=22.5 2023-06-26 08:57:28,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=22.5 2023-06-26 08:57:33,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.87 vs. limit=8.0 2023-06-26 08:57:38,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=12.0 2023-06-26 08:57:46,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1534782.0, ans=0.125 2023-06-26 08:57:53,787 INFO [train.py:996] (3/4) Epoch 9, batch 11850, loss[loss=0.2067, simple_loss=0.3067, pruned_loss=0.05335, over 21823.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3029, pruned_loss=0.07119, over 4276502.13 frames. ], batch size: 282, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 08:59:16,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.272e+02 4.313e+02 5.764e+02 8.343e+02 1.784e+03, threshold=1.153e+03, percent-clipped=5.0 2023-06-26 08:59:16,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1535022.0, ans=0.0 2023-06-26 08:59:31,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-26 08:59:50,215 INFO [train.py:996] (3/4) Epoch 9, batch 11900, loss[loss=0.2541, simple_loss=0.3293, pruned_loss=0.08947, over 21369.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3034, pruned_loss=0.06933, over 4281001.35 frames. ], batch size: 471, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:00:51,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-26 09:00:52,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1535262.0, ans=0.125 2023-06-26 09:01:00,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-26 09:01:08,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1535322.0, ans=0.2 2023-06-26 09:01:36,240 INFO [train.py:996] (3/4) Epoch 9, batch 11950, loss[loss=0.2247, simple_loss=0.3343, pruned_loss=0.05756, over 21201.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3037, pruned_loss=0.06662, over 4276800.09 frames. ], batch size: 548, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:01:42,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1535442.0, ans=0.2 2023-06-26 09:02:46,913 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-26 09:02:50,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.248e+02 4.636e+02 6.640e+02 1.069e+03 2.597e+03, threshold=1.328e+03, percent-clipped=19.0 2023-06-26 09:03:20,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1535682.0, ans=0.0 2023-06-26 09:03:23,556 INFO [train.py:996] (3/4) Epoch 9, batch 12000, loss[loss=0.21, simple_loss=0.2748, pruned_loss=0.07258, over 21786.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2997, pruned_loss=0.06513, over 4270665.62 frames. ], batch size: 371, lr: 3.31e-03, grad_scale: 32.0 2023-06-26 09:03:23,556 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 09:03:41,735 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2638, simple_loss=0.3517, pruned_loss=0.08798, over 1796401.00 frames. 2023-06-26 09:03:41,736 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 09:04:26,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1535802.0, ans=0.125 2023-06-26 09:04:32,639 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.82 vs. limit=10.0 2023-06-26 09:04:37,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1535862.0, ans=0.1 2023-06-26 09:04:54,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1535922.0, ans=0.1 2023-06-26 09:05:31,706 INFO [train.py:996] (3/4) Epoch 9, batch 12050, loss[loss=0.2552, simple_loss=0.3086, pruned_loss=0.1009, over 21796.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2956, pruned_loss=0.06619, over 4271386.65 frames. ], batch size: 508, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:06:30,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1536162.0, ans=0.1 2023-06-26 09:06:30,603 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:06:54,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.278e+02 4.979e+02 7.743e+02 1.300e+03 2.733e+03, threshold=1.549e+03, percent-clipped=23.0 2023-06-26 09:07:03,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1536282.0, ans=0.95 2023-06-26 09:07:34,230 INFO [train.py:996] (3/4) Epoch 9, batch 12100, loss[loss=0.3072, simple_loss=0.3657, pruned_loss=0.1244, over 21372.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3013, pruned_loss=0.07108, over 4277122.35 frames. ], batch size: 507, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:07:51,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1536402.0, ans=0.125 2023-06-26 09:08:39,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1536522.0, ans=0.1 2023-06-26 09:09:12,426 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:09:12,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1536582.0, ans=0.0 2023-06-26 09:09:27,800 INFO [train.py:996] (3/4) Epoch 9, batch 12150, loss[loss=0.176, simple_loss=0.2236, pruned_loss=0.06416, over 20848.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3023, pruned_loss=0.06953, over 4273155.10 frames. ], batch size: 613, lr: 3.31e-03, grad_scale: 16.0 2023-06-26 09:10:43,934 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.377e+02 5.272e+02 8.352e+02 1.536e+03 2.585e+03, threshold=1.670e+03, percent-clipped=24.0 2023-06-26 09:11:19,130 INFO [train.py:996] (3/4) Epoch 9, batch 12200, loss[loss=0.218, simple_loss=0.2754, pruned_loss=0.08026, over 21544.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2994, pruned_loss=0.06945, over 4271538.30 frames. ], batch size: 391, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:11:25,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-26 09:11:35,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1537002.0, ans=0.125 2023-06-26 09:12:01,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1537062.0, ans=0.125 2023-06-26 09:12:05,474 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=12.0 2023-06-26 09:13:05,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1537242.0, ans=0.0 2023-06-26 09:13:06,912 INFO [train.py:996] (3/4) Epoch 9, batch 12250, loss[loss=0.1337, simple_loss=0.2028, pruned_loss=0.03236, over 21065.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2913, pruned_loss=0.06674, over 4266262.92 frames. ], batch size: 143, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:13:30,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1537302.0, ans=0.125 2023-06-26 09:13:41,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1537302.0, ans=0.125 2023-06-26 09:13:50,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1537362.0, ans=0.0 2023-06-26 09:13:50,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1537362.0, ans=0.1 2023-06-26 09:14:12,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.985e+02 4.210e+02 5.762e+02 8.754e+02 2.023e+03, threshold=1.152e+03, percent-clipped=2.0 2023-06-26 09:14:21,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1537422.0, ans=0.2 2023-06-26 09:14:25,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1537482.0, ans=0.125 2023-06-26 09:14:41,542 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:14:55,440 INFO [train.py:996] (3/4) Epoch 9, batch 12300, loss[loss=0.1712, simple_loss=0.2548, pruned_loss=0.04377, over 21389.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2853, pruned_loss=0.06139, over 4259143.14 frames. ], batch size: 194, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:16:18,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.74 vs. limit=10.0 2023-06-26 09:16:29,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1537782.0, ans=0.0 2023-06-26 09:16:42,681 INFO [train.py:996] (3/4) Epoch 9, batch 12350, loss[loss=0.231, simple_loss=0.3148, pruned_loss=0.07365, over 21814.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2902, pruned_loss=0.06212, over 4268948.90 frames. ], batch size: 332, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:16:44,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1537842.0, ans=0.0 2023-06-26 09:17:47,949 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.420e+02 5.624e+02 9.354e+02 1.463e+03 3.322e+03, threshold=1.871e+03, percent-clipped=32.0 2023-06-26 09:18:29,185 INFO [train.py:996] (3/4) Epoch 9, batch 12400, loss[loss=0.2398, simple_loss=0.3005, pruned_loss=0.08956, over 21356.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2924, pruned_loss=0.06601, over 4279277.01 frames. ], batch size: 159, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:18:49,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1538202.0, ans=0.125 2023-06-26 09:20:08,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1538382.0, ans=0.125 2023-06-26 09:20:18,871 INFO [train.py:996] (3/4) Epoch 9, batch 12450, loss[loss=0.2541, simple_loss=0.3299, pruned_loss=0.0892, over 21754.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2947, pruned_loss=0.06824, over 4285301.04 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:20:49,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1538502.0, ans=0.0 2023-06-26 09:21:43,226 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.605e+02 6.014e+02 7.920e+02 1.251e+03 2.737e+03, threshold=1.584e+03, percent-clipped=3.0 2023-06-26 09:21:53,942 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-06-26 09:22:15,990 INFO [train.py:996] (3/4) Epoch 9, batch 12500, loss[loss=0.2998, simple_loss=0.3797, pruned_loss=0.1099, over 21496.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3058, pruned_loss=0.07228, over 4287140.34 frames. ], batch size: 471, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:23:00,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1538862.0, ans=0.125 2023-06-26 09:23:55,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1538982.0, ans=0.1 2023-06-26 09:24:07,307 INFO [train.py:996] (3/4) Epoch 9, batch 12550, loss[loss=0.2505, simple_loss=0.3296, pruned_loss=0.08572, over 21752.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3105, pruned_loss=0.07434, over 4283332.76 frames. ], batch size: 441, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:24:09,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1539042.0, ans=0.125 2023-06-26 09:24:25,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1539042.0, ans=0.125 2023-06-26 09:25:32,843 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 5.506e+02 7.478e+02 1.164e+03 2.448e+03, threshold=1.496e+03, percent-clipped=9.0 2023-06-26 09:25:38,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1539282.0, ans=0.0 2023-06-26 09:26:00,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1539282.0, ans=0.125 2023-06-26 09:26:01,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1539342.0, ans=0.125 2023-06-26 09:26:02,742 INFO [train.py:996] (3/4) Epoch 9, batch 12600, loss[loss=0.1968, simple_loss=0.2878, pruned_loss=0.05286, over 21585.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3082, pruned_loss=0.07159, over 4279869.39 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:26:24,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1539402.0, ans=0.125 2023-06-26 09:27:47,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1539582.0, ans=0.035 2023-06-26 09:27:50,961 INFO [train.py:996] (3/4) Epoch 9, batch 12650, loss[loss=0.1907, simple_loss=0.2504, pruned_loss=0.0655, over 20231.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.3012, pruned_loss=0.068, over 4278908.21 frames. ], batch size: 703, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:28:01,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1539642.0, ans=0.2 2023-06-26 09:29:07,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-26 09:29:07,199 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-26 09:29:09,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.312e+02 4.812e+02 9.064e+02 1.405e+03 2.946e+03, threshold=1.813e+03, percent-clipped=21.0 2023-06-26 09:29:44,738 INFO [train.py:996] (3/4) Epoch 9, batch 12700, loss[loss=0.2322, simple_loss=0.302, pruned_loss=0.08124, over 21422.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3001, pruned_loss=0.06974, over 4286957.42 frames. ], batch size: 548, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:30:05,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1540002.0, ans=0.125 2023-06-26 09:30:20,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1540002.0, ans=0.125 2023-06-26 09:30:29,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1540062.0, ans=0.05 2023-06-26 09:31:19,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.62 vs. limit=15.0 2023-06-26 09:31:21,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1540182.0, ans=0.0 2023-06-26 09:31:32,355 INFO [train.py:996] (3/4) Epoch 9, batch 12750, loss[loss=0.2078, simple_loss=0.2995, pruned_loss=0.05802, over 21707.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3015, pruned_loss=0.07012, over 4291989.18 frames. ], batch size: 351, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:31:47,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1540242.0, ans=0.0 2023-06-26 09:32:02,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1540302.0, ans=0.1 2023-06-26 09:32:45,552 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.578e+02 5.161e+02 7.205e+02 9.772e+02 1.736e+03, threshold=1.441e+03, percent-clipped=0.0 2023-06-26 09:32:47,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1540422.0, ans=0.125 2023-06-26 09:33:00,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1540482.0, ans=0.125 2023-06-26 09:33:19,507 INFO [train.py:996] (3/4) Epoch 9, batch 12800, loss[loss=0.2343, simple_loss=0.3077, pruned_loss=0.08039, over 21784.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2995, pruned_loss=0.07029, over 4294201.30 frames. ], batch size: 298, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:33:26,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1540542.0, ans=0.0 2023-06-26 09:33:37,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1540542.0, ans=0.125 2023-06-26 09:33:54,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1540602.0, ans=0.125 2023-06-26 09:34:05,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1540662.0, ans=0.1 2023-06-26 09:34:42,535 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:35:13,872 INFO [train.py:996] (3/4) Epoch 9, batch 12850, loss[loss=0.2097, simple_loss=0.2991, pruned_loss=0.06012, over 21777.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3021, pruned_loss=0.07124, over 4294526.50 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:35:27,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1540842.0, ans=0.125 2023-06-26 09:35:54,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1540962.0, ans=0.1 2023-06-26 09:35:56,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1540962.0, ans=0.125 2023-06-26 09:36:33,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1541022.0, ans=0.0 2023-06-26 09:36:36,357 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.020e+02 4.565e+02 5.945e+02 7.206e+02 1.665e+03, threshold=1.189e+03, percent-clipped=1.0 2023-06-26 09:36:40,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1541082.0, ans=0.1 2023-06-26 09:37:04,595 INFO [train.py:996] (3/4) Epoch 9, batch 12900, loss[loss=0.2382, simple_loss=0.3287, pruned_loss=0.07385, over 21632.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.3002, pruned_loss=0.06877, over 4276937.74 frames. ], batch size: 442, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:37:22,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1541202.0, ans=0.0 2023-06-26 09:38:55,156 INFO [train.py:996] (3/4) Epoch 9, batch 12950, loss[loss=0.2363, simple_loss=0.3217, pruned_loss=0.07548, over 21806.00 frames. ], tot_loss[loss=0.218, simple_loss=0.3006, pruned_loss=0.06768, over 4269499.52 frames. ], batch size: 118, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:39:53,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-26 09:40:15,663 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.62 vs. limit=15.0 2023-06-26 09:40:21,287 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.550e+02 5.457e+02 7.611e+02 1.240e+03 2.264e+03, threshold=1.522e+03, percent-clipped=25.0 2023-06-26 09:40:30,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1541682.0, ans=0.125 2023-06-26 09:40:43,293 INFO [train.py:996] (3/4) Epoch 9, batch 13000, loss[loss=0.1473, simple_loss=0.2207, pruned_loss=0.03692, over 21160.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.3017, pruned_loss=0.06795, over 4265074.88 frames. ], batch size: 143, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:42:08,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1541922.0, ans=10.0 2023-06-26 09:42:31,803 INFO [train.py:996] (3/4) Epoch 9, batch 13050, loss[loss=0.2471, simple_loss=0.3096, pruned_loss=0.09233, over 21700.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2967, pruned_loss=0.06547, over 4268979.21 frames. ], batch size: 473, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:42:35,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1542042.0, ans=0.2 2023-06-26 09:42:48,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1542102.0, ans=0.125 2023-06-26 09:42:58,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=22.5 2023-06-26 09:43:58,308 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.428e+02 4.464e+02 7.205e+02 1.000e+03 2.248e+03, threshold=1.441e+03, percent-clipped=5.0 2023-06-26 09:44:21,913 INFO [train.py:996] (3/4) Epoch 9, batch 13100, loss[loss=0.2313, simple_loss=0.3079, pruned_loss=0.07732, over 21334.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2973, pruned_loss=0.06532, over 4276747.03 frames. ], batch size: 159, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:44:39,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1542342.0, ans=0.125 2023-06-26 09:44:55,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1542402.0, ans=0.125 2023-06-26 09:45:49,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1542522.0, ans=0.0 2023-06-26 09:46:02,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-26 09:46:16,501 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=22.5 2023-06-26 09:46:19,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1542642.0, ans=0.125 2023-06-26 09:46:20,399 INFO [train.py:996] (3/4) Epoch 9, batch 13150, loss[loss=0.1877, simple_loss=0.2684, pruned_loss=0.05344, over 21869.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.3002, pruned_loss=0.06764, over 4276741.10 frames. ], batch size: 317, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:46:47,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1542702.0, ans=0.0 2023-06-26 09:46:49,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1542702.0, ans=0.2 2023-06-26 09:46:53,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1542702.0, ans=0.125 2023-06-26 09:47:30,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1542822.0, ans=0.0 2023-06-26 09:47:34,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-26 09:47:43,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.265e+02 6.125e+02 9.524e+02 1.520e+03 3.301e+03, threshold=1.905e+03, percent-clipped=27.0 2023-06-26 09:48:09,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1542882.0, ans=0.125 2023-06-26 09:48:14,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1542882.0, ans=0.125 2023-06-26 09:48:23,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1542942.0, ans=0.0 2023-06-26 09:48:24,342 INFO [train.py:996] (3/4) Epoch 9, batch 13200, loss[loss=0.2341, simple_loss=0.306, pruned_loss=0.08112, over 21311.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2979, pruned_loss=0.06775, over 4274970.63 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 32.0 2023-06-26 09:48:51,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1543002.0, ans=0.1 2023-06-26 09:49:03,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-26 09:50:16,132 INFO [train.py:996] (3/4) Epoch 9, batch 13250, loss[loss=0.2361, simple_loss=0.3105, pruned_loss=0.08084, over 21695.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2988, pruned_loss=0.07023, over 4268667.69 frames. ], batch size: 441, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 09:50:18,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1543242.0, ans=0.2 2023-06-26 09:51:48,025 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.553e+02 4.712e+02 6.598e+02 9.234e+02 1.581e+03, threshold=1.320e+03, percent-clipped=0.0 2023-06-26 09:51:52,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1543482.0, ans=0.125 2023-06-26 09:52:13,052 INFO [train.py:996] (3/4) Epoch 9, batch 13300, loss[loss=0.2401, simple_loss=0.3179, pruned_loss=0.08112, over 21638.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3019, pruned_loss=0.07024, over 4269612.65 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:52:25,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1543542.0, ans=0.07 2023-06-26 09:53:58,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1543782.0, ans=0.125 2023-06-26 09:54:02,879 INFO [train.py:996] (3/4) Epoch 9, batch 13350, loss[loss=0.2145, simple_loss=0.3443, pruned_loss=0.04234, over 19732.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3063, pruned_loss=0.07247, over 4274115.66 frames. ], batch size: 702, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:54:23,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1543902.0, ans=0.0 2023-06-26 09:55:04,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1544022.0, ans=0.2 2023-06-26 09:55:27,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.028e+02 5.402e+02 7.933e+02 1.042e+03 2.169e+03, threshold=1.587e+03, percent-clipped=13.0 2023-06-26 09:55:29,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1544082.0, ans=0.0 2023-06-26 09:55:30,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1544082.0, ans=0.0 2023-06-26 09:55:30,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2023-06-26 09:55:43,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1544082.0, ans=0.0 2023-06-26 09:55:51,780 INFO [train.py:996] (3/4) Epoch 9, batch 13400, loss[loss=0.2045, simple_loss=0.2842, pruned_loss=0.06244, over 21888.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3076, pruned_loss=0.07471, over 4279469.75 frames. ], batch size: 316, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:55:58,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1544142.0, ans=0.125 2023-06-26 09:56:01,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1544142.0, ans=0.125 2023-06-26 09:56:31,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-26 09:56:32,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1544262.0, ans=0.0 2023-06-26 09:56:45,840 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.66 vs. limit=12.0 2023-06-26 09:56:56,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1544322.0, ans=0.0 2023-06-26 09:57:02,619 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=22.5 2023-06-26 09:57:16,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-26 09:57:25,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1544382.0, ans=0.07 2023-06-26 09:57:27,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1544382.0, ans=0.125 2023-06-26 09:57:39,175 INFO [train.py:996] (3/4) Epoch 9, batch 13450, loss[loss=0.2071, simple_loss=0.2818, pruned_loss=0.06614, over 21759.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3076, pruned_loss=0.07622, over 4277315.78 frames. ], batch size: 118, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 09:58:32,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.24 vs. limit=10.0 2023-06-26 09:59:10,545 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.398e+02 5.052e+02 6.156e+02 8.765e+02 1.835e+03, threshold=1.231e+03, percent-clipped=4.0 2023-06-26 09:59:11,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1544682.0, ans=0.125 2023-06-26 09:59:30,320 INFO [train.py:996] (3/4) Epoch 9, batch 13500, loss[loss=0.2455, simple_loss=0.3225, pruned_loss=0.08426, over 21344.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3008, pruned_loss=0.07463, over 4273746.25 frames. ], batch size: 549, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 10:00:32,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1544862.0, ans=0.125 2023-06-26 10:01:02,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1544982.0, ans=0.0 2023-06-26 10:01:06,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1544982.0, ans=0.0 2023-06-26 10:01:11,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1544982.0, ans=0.125 2023-06-26 10:01:27,154 INFO [train.py:996] (3/4) Epoch 9, batch 13550, loss[loss=0.2305, simple_loss=0.3376, pruned_loss=0.06173, over 21675.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3017, pruned_loss=0.07256, over 4277658.37 frames. ], batch size: 263, lr: 3.30e-03, grad_scale: 8.0 2023-06-26 10:01:27,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1545042.0, ans=0.2 2023-06-26 10:02:17,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-06-26 10:02:27,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1545162.0, ans=0.0 2023-06-26 10:02:51,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.799e+02 5.933e+02 9.358e+02 1.476e+03 2.986e+03, threshold=1.872e+03, percent-clipped=34.0 2023-06-26 10:02:54,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1545282.0, ans=0.0 2023-06-26 10:03:04,302 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.97 vs. limit=6.0 2023-06-26 10:03:16,848 INFO [train.py:996] (3/4) Epoch 9, batch 13600, loss[loss=0.1922, simple_loss=0.2701, pruned_loss=0.05721, over 21786.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3031, pruned_loss=0.0723, over 4277934.46 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 10:04:09,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1545462.0, ans=0.0 2023-06-26 10:05:04,165 INFO [train.py:996] (3/4) Epoch 9, batch 13650, loss[loss=0.183, simple_loss=0.2488, pruned_loss=0.05863, over 21636.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2973, pruned_loss=0.0693, over 4281848.43 frames. ], batch size: 247, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 10:05:24,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1545642.0, ans=0.1 2023-06-26 10:05:58,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1545762.0, ans=0.0 2023-06-26 10:06:09,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-26 10:06:23,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-26 10:06:23,704 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.401e+02 4.955e+02 6.723e+02 8.963e+02 2.035e+03, threshold=1.345e+03, percent-clipped=1.0 2023-06-26 10:06:48,927 INFO [train.py:996] (3/4) Epoch 9, batch 13700, loss[loss=0.1833, simple_loss=0.2473, pruned_loss=0.05967, over 21248.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2926, pruned_loss=0.06811, over 4271460.03 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-26 10:07:13,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1546002.0, ans=0.2 2023-06-26 10:07:41,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1546062.0, ans=0.07 2023-06-26 10:07:43,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1546062.0, ans=0.0 2023-06-26 10:08:01,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-26 10:08:15,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1546182.0, ans=0.2 2023-06-26 10:08:45,477 INFO [train.py:996] (3/4) Epoch 9, batch 13750, loss[loss=0.2064, simple_loss=0.2696, pruned_loss=0.07153, over 21545.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2905, pruned_loss=0.06805, over 4264962.56 frames. ], batch size: 195, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:08:47,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1546242.0, ans=0.1 2023-06-26 10:08:49,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1546242.0, ans=0.0 2023-06-26 10:08:50,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-26 10:09:44,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1546422.0, ans=0.0 2023-06-26 10:10:08,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1546422.0, ans=0.125 2023-06-26 10:10:16,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 6.154e+02 1.114e+03 1.508e+03 3.073e+03, threshold=2.228e+03, percent-clipped=34.0 2023-06-26 10:10:41,606 INFO [train.py:996] (3/4) Epoch 9, batch 13800, loss[loss=0.2302, simple_loss=0.3331, pruned_loss=0.06368, over 21779.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2966, pruned_loss=0.06825, over 4260286.13 frames. ], batch size: 282, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:11:01,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1546602.0, ans=0.125 2023-06-26 10:11:19,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.04 vs. limit=22.5 2023-06-26 10:12:01,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1546722.0, ans=0.1 2023-06-26 10:12:32,867 INFO [train.py:996] (3/4) Epoch 9, batch 13850, loss[loss=0.2854, simple_loss=0.366, pruned_loss=0.1024, over 21733.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3031, pruned_loss=0.06955, over 4265206.25 frames. ], batch size: 441, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:13:53,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-26 10:13:57,545 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.808e+02 5.512e+02 9.000e+02 1.173e+03 2.021e+03, threshold=1.800e+03, percent-clipped=1.0 2023-06-26 10:14:12,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1547082.0, ans=0.1 2023-06-26 10:14:13,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1547082.0, ans=0.125 2023-06-26 10:14:22,455 INFO [train.py:996] (3/4) Epoch 9, batch 13900, loss[loss=0.2327, simple_loss=0.3025, pruned_loss=0.08139, over 21829.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3065, pruned_loss=0.07273, over 4264266.88 frames. ], batch size: 298, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:15:09,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1547262.0, ans=0.0 2023-06-26 10:15:10,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1547262.0, ans=0.1 2023-06-26 10:15:11,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2023-06-26 10:15:26,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1547262.0, ans=0.125 2023-06-26 10:16:11,147 INFO [train.py:996] (3/4) Epoch 9, batch 13950, loss[loss=0.2229, simple_loss=0.3023, pruned_loss=0.07182, over 21775.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3059, pruned_loss=0.07366, over 4270691.22 frames. ], batch size: 298, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:16:36,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1547502.0, ans=0.125 2023-06-26 10:16:53,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1547562.0, ans=0.125 2023-06-26 10:17:09,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1547562.0, ans=0.125 2023-06-26 10:17:33,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1547622.0, ans=0.5 2023-06-26 10:17:34,993 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.697e+02 5.601e+02 7.890e+02 1.100e+03 2.147e+03, threshold=1.578e+03, percent-clipped=2.0 2023-06-26 10:17:58,870 INFO [train.py:996] (3/4) Epoch 9, batch 14000, loss[loss=0.1715, simple_loss=0.245, pruned_loss=0.04905, over 21489.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3029, pruned_loss=0.07153, over 4269176.19 frames. ], batch size: 195, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:18:01,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1547742.0, ans=0.125 2023-06-26 10:18:15,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1547802.0, ans=0.125 2023-06-26 10:19:05,443 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:19:19,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1547922.0, ans=0.2 2023-06-26 10:19:22,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1547982.0, ans=0.2 2023-06-26 10:19:46,297 INFO [train.py:996] (3/4) Epoch 9, batch 14050, loss[loss=0.188, simple_loss=0.2535, pruned_loss=0.06123, over 21563.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2968, pruned_loss=0.06772, over 4275689.34 frames. ], batch size: 230, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:19:47,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1548042.0, ans=0.0 2023-06-26 10:19:51,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1548042.0, ans=0.0 2023-06-26 10:20:50,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1548222.0, ans=0.0 2023-06-26 10:21:06,321 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.027e+02 4.796e+02 7.490e+02 1.046e+03 2.202e+03, threshold=1.498e+03, percent-clipped=4.0 2023-06-26 10:21:30,928 INFO [train.py:996] (3/4) Epoch 9, batch 14100, loss[loss=0.2262, simple_loss=0.2926, pruned_loss=0.07988, over 21226.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2929, pruned_loss=0.06735, over 4260203.73 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:21:31,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1548342.0, ans=0.125 2023-06-26 10:21:50,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1548402.0, ans=0.2 2023-06-26 10:22:09,740 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:22:46,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1548522.0, ans=0.0 2023-06-26 10:23:18,192 INFO [train.py:996] (3/4) Epoch 9, batch 14150, loss[loss=0.2344, simple_loss=0.3188, pruned_loss=0.07501, over 21769.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2973, pruned_loss=0.06866, over 4242365.36 frames. ], batch size: 112, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:23:28,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1548642.0, ans=0.0 2023-06-26 10:23:31,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.46 vs. limit=15.0 2023-06-26 10:23:35,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1548702.0, ans=0.0 2023-06-26 10:23:42,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1548702.0, ans=0.125 2023-06-26 10:24:06,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1548762.0, ans=0.0 2023-06-26 10:24:42,645 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.188e+02 5.825e+02 9.276e+02 1.325e+03 2.479e+03, threshold=1.855e+03, percent-clipped=15.0 2023-06-26 10:24:48,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1548882.0, ans=0.125 2023-06-26 10:24:53,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1548882.0, ans=0.1 2023-06-26 10:24:59,276 INFO [train.py:996] (3/4) Epoch 9, batch 14200, loss[loss=0.2041, simple_loss=0.2668, pruned_loss=0.07076, over 21461.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.296, pruned_loss=0.06749, over 4252201.08 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:26:34,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1549182.0, ans=0.125 2023-06-26 10:26:46,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-26 10:26:47,078 INFO [train.py:996] (3/4) Epoch 9, batch 14250, loss[loss=0.1841, simple_loss=0.2554, pruned_loss=0.05637, over 21661.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2907, pruned_loss=0.0678, over 4256343.21 frames. ], batch size: 282, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:27:03,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1549242.0, ans=0.125 2023-06-26 10:27:21,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1549302.0, ans=0.1 2023-06-26 10:28:11,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1549422.0, ans=15.0 2023-06-26 10:28:11,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-26 10:28:22,815 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.193e+02 4.868e+02 6.668e+02 9.362e+02 2.470e+03, threshold=1.334e+03, percent-clipped=6.0 2023-06-26 10:28:32,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1549482.0, ans=0.125 2023-06-26 10:28:43,655 INFO [train.py:996] (3/4) Epoch 9, batch 14300, loss[loss=0.3004, simple_loss=0.4127, pruned_loss=0.09406, over 21247.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2956, pruned_loss=0.06831, over 4262786.85 frames. ], batch size: 549, lr: 3.29e-03, grad_scale: 8.0 2023-06-26 10:28:55,581 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:29:46,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1549662.0, ans=0.1 2023-06-26 10:30:04,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1549722.0, ans=0.0 2023-06-26 10:30:08,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1549722.0, ans=0.035 2023-06-26 10:30:08,859 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=15.0 2023-06-26 10:30:18,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1549782.0, ans=0.0 2023-06-26 10:30:33,271 INFO [train.py:996] (3/4) Epoch 9, batch 14350, loss[loss=0.19, simple_loss=0.2629, pruned_loss=0.05852, over 21396.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3031, pruned_loss=0.06994, over 4266561.90 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 8.0 2023-06-26 10:30:42,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1549842.0, ans=0.0 2023-06-26 10:31:01,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-26 10:32:00,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.420e+02 5.713e+02 8.636e+02 1.390e+03 3.076e+03, threshold=1.727e+03, percent-clipped=28.0 2023-06-26 10:32:00,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1550082.0, ans=0.125 2023-06-26 10:32:21,228 INFO [train.py:996] (3/4) Epoch 9, batch 14400, loss[loss=0.1972, simple_loss=0.2623, pruned_loss=0.06603, over 21680.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2995, pruned_loss=0.06995, over 4263983.65 frames. ], batch size: 282, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:32:37,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1550202.0, ans=0.125 2023-06-26 10:32:43,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1550202.0, ans=0.1 2023-06-26 10:32:59,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1550262.0, ans=0.0 2023-06-26 10:33:33,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=12.0 2023-06-26 10:33:39,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1550322.0, ans=0.125 2023-06-26 10:34:03,153 INFO [train.py:996] (3/4) Epoch 9, batch 14450, loss[loss=0.219, simple_loss=0.2858, pruned_loss=0.07615, over 20716.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2934, pruned_loss=0.07028, over 4265830.26 frames. ], batch size: 609, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:34:17,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1550442.0, ans=0.0 2023-06-26 10:35:31,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1550622.0, ans=0.04949747468305833 2023-06-26 10:35:36,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.305e+02 4.619e+02 5.727e+02 8.380e+02 1.480e+03, threshold=1.145e+03, percent-clipped=0.0 2023-06-26 10:35:43,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1550682.0, ans=0.2 2023-06-26 10:35:56,844 INFO [train.py:996] (3/4) Epoch 9, batch 14500, loss[loss=0.2022, simple_loss=0.2816, pruned_loss=0.06143, over 21801.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.29, pruned_loss=0.0698, over 4267412.87 frames. ], batch size: 118, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:36:11,176 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:37:00,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1550862.0, ans=0.05 2023-06-26 10:37:13,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1550922.0, ans=0.125 2023-06-26 10:37:22,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1550982.0, ans=15.0 2023-06-26 10:37:46,731 INFO [train.py:996] (3/4) Epoch 9, batch 14550, loss[loss=0.2574, simple_loss=0.3399, pruned_loss=0.08745, over 21565.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2931, pruned_loss=0.07031, over 4260927.01 frames. ], batch size: 414, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:38:39,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1551162.0, ans=0.2 2023-06-26 10:39:20,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.787e+02 5.550e+02 7.546e+02 1.212e+03 2.573e+03, threshold=1.509e+03, percent-clipped=29.0 2023-06-26 10:39:29,650 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:39:35,747 INFO [train.py:996] (3/4) Epoch 9, batch 14600, loss[loss=0.2445, simple_loss=0.3239, pruned_loss=0.08253, over 21877.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2991, pruned_loss=0.07281, over 4265274.37 frames. ], batch size: 371, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:39:43,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1551342.0, ans=0.2 2023-06-26 10:40:35,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1551462.0, ans=0.125 2023-06-26 10:41:24,130 INFO [train.py:996] (3/4) Epoch 9, batch 14650, loss[loss=0.2319, simple_loss=0.3192, pruned_loss=0.07232, over 21680.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3015, pruned_loss=0.07191, over 4272829.06 frames. ], batch size: 441, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:41:35,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1551642.0, ans=0.05 2023-06-26 10:42:27,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1551822.0, ans=0.0 2023-06-26 10:42:41,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1551822.0, ans=0.0 2023-06-26 10:42:46,534 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.116e+02 4.456e+02 7.843e+02 1.118e+03 1.924e+03, threshold=1.569e+03, percent-clipped=10.0 2023-06-26 10:43:04,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1551882.0, ans=0.125 2023-06-26 10:43:07,388 INFO [train.py:996] (3/4) Epoch 9, batch 14700, loss[loss=0.1849, simple_loss=0.267, pruned_loss=0.05136, over 21323.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2944, pruned_loss=0.06647, over 4275393.79 frames. ], batch size: 131, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:43:13,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1551942.0, ans=0.1 2023-06-26 10:43:18,944 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:43:34,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1552002.0, ans=0.2 2023-06-26 10:44:19,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1552122.0, ans=0.125 2023-06-26 10:44:19,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1552122.0, ans=0.07 2023-06-26 10:44:25,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1552122.0, ans=0.04949747468305833 2023-06-26 10:44:58,835 INFO [train.py:996] (3/4) Epoch 9, batch 14750, loss[loss=0.2585, simple_loss=0.3456, pruned_loss=0.08565, over 21597.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2982, pruned_loss=0.06911, over 4266054.46 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:45:14,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1552242.0, ans=0.2 2023-06-26 10:45:14,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1552242.0, ans=0.0 2023-06-26 10:45:16,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1552242.0, ans=0.125 2023-06-26 10:45:38,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1552302.0, ans=0.125 2023-06-26 10:45:41,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1552302.0, ans=0.1 2023-06-26 10:46:34,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 5.782e+02 7.997e+02 1.225e+03 2.854e+03, threshold=1.599e+03, percent-clipped=14.0 2023-06-26 10:46:44,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1552482.0, ans=0.2 2023-06-26 10:46:55,532 INFO [train.py:996] (3/4) Epoch 9, batch 14800, loss[loss=0.2223, simple_loss=0.2956, pruned_loss=0.07448, over 21613.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.309, pruned_loss=0.07388, over 4270002.32 frames. ], batch size: 298, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 10:47:16,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1552542.0, ans=15.0 2023-06-26 10:48:10,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1552722.0, ans=0.1 2023-06-26 10:48:16,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.61 vs. limit=10.0 2023-06-26 10:48:59,072 INFO [train.py:996] (3/4) Epoch 9, batch 14850, loss[loss=0.2065, simple_loss=0.2705, pruned_loss=0.07129, over 21247.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3025, pruned_loss=0.0732, over 4256440.46 frames. ], batch size: 176, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:49:02,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1552842.0, ans=0.125 2023-06-26 10:50:11,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1553022.0, ans=0.04949747468305833 2023-06-26 10:50:35,326 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.505e+02 5.144e+02 7.174e+02 1.026e+03 2.687e+03, threshold=1.435e+03, percent-clipped=5.0 2023-06-26 10:50:50,334 INFO [train.py:996] (3/4) Epoch 9, batch 14900, loss[loss=0.2449, simple_loss=0.3239, pruned_loss=0.08292, over 21941.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3047, pruned_loss=0.07466, over 4260176.87 frames. ], batch size: 372, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:51:13,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-26 10:51:32,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1553262.0, ans=0.07 2023-06-26 10:51:32,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.03 vs. limit=10.0 2023-06-26 10:51:48,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1553262.0, ans=0.0 2023-06-26 10:52:01,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1553322.0, ans=0.125 2023-06-26 10:52:46,143 INFO [train.py:996] (3/4) Epoch 9, batch 14950, loss[loss=0.2341, simple_loss=0.3144, pruned_loss=0.0769, over 21213.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3052, pruned_loss=0.07395, over 4271621.99 frames. ], batch size: 176, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:53:02,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1553502.0, ans=0.0 2023-06-26 10:54:17,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.700e+02 5.284e+02 7.127e+02 1.003e+03 2.591e+03, threshold=1.425e+03, percent-clipped=12.0 2023-06-26 10:54:37,172 INFO [train.py:996] (3/4) Epoch 9, batch 15000, loss[loss=0.2226, simple_loss=0.2861, pruned_loss=0.07955, over 21485.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3093, pruned_loss=0.07645, over 4265750.37 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:54:37,173 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 10:54:55,450 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2558, simple_loss=0.3464, pruned_loss=0.08259, over 1796401.00 frames. 2023-06-26 10:54:55,451 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 10:55:24,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1553802.0, ans=0.0 2023-06-26 10:55:25,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1553802.0, ans=0.0 2023-06-26 10:55:55,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=12.0 2023-06-26 10:56:02,737 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-26 10:56:12,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1553922.0, ans=0.125 2023-06-26 10:56:24,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1553922.0, ans=0.1 2023-06-26 10:56:42,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=22.5 2023-06-26 10:56:46,882 INFO [train.py:996] (3/4) Epoch 9, batch 15050, loss[loss=0.1895, simple_loss=0.25, pruned_loss=0.06446, over 21814.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3081, pruned_loss=0.07672, over 4260607.02 frames. ], batch size: 102, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:57:34,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1554102.0, ans=0.125 2023-06-26 10:58:21,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.687e+02 6.658e+02 1.222e+03 1.555e+03 2.780e+03, threshold=2.443e+03, percent-clipped=32.0 2023-06-26 10:58:27,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1554282.0, ans=0.1 2023-06-26 10:58:40,322 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 10:58:41,242 INFO [train.py:996] (3/4) Epoch 9, batch 15100, loss[loss=0.2576, simple_loss=0.3312, pruned_loss=0.09202, over 21287.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3101, pruned_loss=0.07619, over 4267881.26 frames. ], batch size: 143, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 10:58:42,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.23 vs. limit=6.0 2023-06-26 10:59:37,245 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-26 10:59:52,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1554522.0, ans=0.125 2023-06-26 11:00:14,997 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=12.0 2023-06-26 11:00:29,582 INFO [train.py:996] (3/4) Epoch 9, batch 15150, loss[loss=0.1983, simple_loss=0.2588, pruned_loss=0.06891, over 21202.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3052, pruned_loss=0.07527, over 4269126.50 frames. ], batch size: 159, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 11:00:37,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1554642.0, ans=0.125 2023-06-26 11:01:06,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1554702.0, ans=0.125 2023-06-26 11:01:31,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1554762.0, ans=0.125 2023-06-26 11:02:05,231 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 4.649e+02 7.475e+02 1.057e+03 2.217e+03, threshold=1.495e+03, percent-clipped=0.0 2023-06-26 11:02:10,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1554882.0, ans=0.125 2023-06-26 11:02:19,219 INFO [train.py:996] (3/4) Epoch 9, batch 15200, loss[loss=0.1799, simple_loss=0.2427, pruned_loss=0.05853, over 22019.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2981, pruned_loss=0.07218, over 4268095.50 frames. ], batch size: 103, lr: 3.29e-03, grad_scale: 32.0 2023-06-26 11:02:59,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.63 vs. limit=15.0 2023-06-26 11:03:12,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1555062.0, ans=0.125 2023-06-26 11:03:15,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1555062.0, ans=0.0 2023-06-26 11:03:37,402 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-06-26 11:04:01,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.83 vs. limit=10.0 2023-06-26 11:04:12,988 INFO [train.py:996] (3/4) Epoch 9, batch 15250, loss[loss=0.2298, simple_loss=0.2994, pruned_loss=0.08004, over 21874.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2938, pruned_loss=0.07124, over 4274847.59 frames. ], batch size: 107, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 11:04:15,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1555242.0, ans=0.0 2023-06-26 11:04:16,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1555242.0, ans=0.125 2023-06-26 11:04:34,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1555302.0, ans=0.0 2023-06-26 11:05:00,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1555362.0, ans=0.125 2023-06-26 11:05:08,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1555362.0, ans=10.0 2023-06-26 11:05:44,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.438e+02 5.756e+02 7.926e+02 1.187e+03 2.967e+03, threshold=1.585e+03, percent-clipped=10.0 2023-06-26 11:06:02,508 INFO [train.py:996] (3/4) Epoch 9, batch 15300, loss[loss=0.2314, simple_loss=0.3031, pruned_loss=0.07986, over 21759.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2969, pruned_loss=0.07417, over 4274626.32 frames. ], batch size: 247, lr: 3.29e-03, grad_scale: 16.0 2023-06-26 11:06:18,481 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.54 vs. limit=22.5 2023-06-26 11:06:28,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1555602.0, ans=0.125 2023-06-26 11:06:32,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1555602.0, ans=0.125 2023-06-26 11:06:52,019 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-26 11:07:12,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1555722.0, ans=0.2 2023-06-26 11:07:50,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.79 vs. limit=10.0 2023-06-26 11:07:52,677 INFO [train.py:996] (3/4) Epoch 9, batch 15350, loss[loss=0.2482, simple_loss=0.3168, pruned_loss=0.08981, over 21350.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3005, pruned_loss=0.07615, over 4278818.02 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:08:17,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1555902.0, ans=0.1 2023-06-26 11:09:22,277 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.533e+02 5.256e+02 7.334e+02 1.092e+03 2.120e+03, threshold=1.467e+03, percent-clipped=2.0 2023-06-26 11:09:39,843 INFO [train.py:996] (3/4) Epoch 9, batch 15400, loss[loss=0.2027, simple_loss=0.2833, pruned_loss=0.06107, over 21517.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3021, pruned_loss=0.0742, over 4274452.26 frames. ], batch size: 211, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:09:41,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-26 11:10:11,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1556202.0, ans=0.2 2023-06-26 11:10:24,360 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-26 11:11:23,608 INFO [train.py:996] (3/4) Epoch 9, batch 15450, loss[loss=0.2195, simple_loss=0.2758, pruned_loss=0.08166, over 21607.00 frames. ], tot_loss[loss=0.223, simple_loss=0.299, pruned_loss=0.07351, over 4282673.89 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:12:44,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1556622.0, ans=0.5 2023-06-26 11:13:00,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1556682.0, ans=0.125 2023-06-26 11:13:01,732 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.285e+02 4.674e+02 6.020e+02 7.889e+02 1.710e+03, threshold=1.204e+03, percent-clipped=2.0 2023-06-26 11:13:20,032 INFO [train.py:996] (3/4) Epoch 9, batch 15500, loss[loss=0.237, simple_loss=0.3128, pruned_loss=0.08058, over 21329.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3018, pruned_loss=0.07321, over 4272550.38 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:14:00,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1556862.0, ans=0.2 2023-06-26 11:14:36,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1556922.0, ans=0.2 2023-06-26 11:15:06,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-26 11:15:11,416 INFO [train.py:996] (3/4) Epoch 9, batch 15550, loss[loss=0.2316, simple_loss=0.3197, pruned_loss=0.07171, over 21267.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3024, pruned_loss=0.07088, over 4268943.76 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:15:22,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1557042.0, ans=0.2 2023-06-26 11:16:41,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.382e+02 5.062e+02 7.091e+02 1.054e+03 2.391e+03, threshold=1.418e+03, percent-clipped=18.0 2023-06-26 11:16:59,946 INFO [train.py:996] (3/4) Epoch 9, batch 15600, loss[loss=0.2161, simple_loss=0.2991, pruned_loss=0.0665, over 21601.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2956, pruned_loss=0.06949, over 4262908.56 frames. ], batch size: 414, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:17:07,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1557342.0, ans=0.5 2023-06-26 11:17:26,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1557402.0, ans=0.0 2023-06-26 11:17:38,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-26 11:18:05,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1557522.0, ans=0.125 2023-06-26 11:18:05,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-26 11:18:12,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-26 11:18:24,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1557582.0, ans=0.07 2023-06-26 11:18:43,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1557582.0, ans=0.0 2023-06-26 11:18:48,408 INFO [train.py:996] (3/4) Epoch 9, batch 15650, loss[loss=0.2261, simple_loss=0.2913, pruned_loss=0.08048, over 15886.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2943, pruned_loss=0.06896, over 4253437.61 frames. ], batch size: 67, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:18:54,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1557642.0, ans=0.125 2023-06-26 11:20:03,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-26 11:20:25,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.281e+02 4.437e+02 5.415e+02 7.572e+02 1.667e+03, threshold=1.083e+03, percent-clipped=3.0 2023-06-26 11:20:43,525 INFO [train.py:996] (3/4) Epoch 9, batch 15700, loss[loss=0.2076, simple_loss=0.269, pruned_loss=0.07312, over 21182.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2902, pruned_loss=0.06823, over 4251620.99 frames. ], batch size: 143, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:21:24,160 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=15.0 2023-06-26 11:22:30,871 INFO [train.py:996] (3/4) Epoch 9, batch 15750, loss[loss=0.1848, simple_loss=0.2523, pruned_loss=0.05871, over 16370.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2857, pruned_loss=0.06791, over 4254760.50 frames. ], batch size: 66, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:23:00,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1558302.0, ans=0.125 2023-06-26 11:23:01,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.40 vs. limit=22.5 2023-06-26 11:23:58,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1558482.0, ans=0.05 2023-06-26 11:24:01,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.153e+02 4.399e+02 6.641e+02 9.028e+02 1.552e+03, threshold=1.328e+03, percent-clipped=11.0 2023-06-26 11:24:07,118 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 11:24:18,411 INFO [train.py:996] (3/4) Epoch 9, batch 15800, loss[loss=0.1797, simple_loss=0.2324, pruned_loss=0.06347, over 20789.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2809, pruned_loss=0.06743, over 4253889.12 frames. ], batch size: 608, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:24:19,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1558542.0, ans=0.125 2023-06-26 11:24:45,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1558602.0, ans=0.0 2023-06-26 11:24:52,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1558662.0, ans=0.125 2023-06-26 11:25:16,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1558722.0, ans=0.125 2023-06-26 11:25:51,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1558782.0, ans=0.125 2023-06-26 11:26:06,287 INFO [train.py:996] (3/4) Epoch 9, batch 15850, loss[loss=0.2535, simple_loss=0.3161, pruned_loss=0.09548, over 21237.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2839, pruned_loss=0.06927, over 4258127.96 frames. ], batch size: 143, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:26:30,660 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-26 11:27:07,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1559022.0, ans=0.0 2023-06-26 11:27:38,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.367e+02 5.055e+02 6.778e+02 9.936e+02 2.216e+03, threshold=1.356e+03, percent-clipped=9.0 2023-06-26 11:27:49,526 INFO [train.py:996] (3/4) Epoch 9, batch 15900, loss[loss=0.2226, simple_loss=0.3032, pruned_loss=0.07096, over 21827.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.281, pruned_loss=0.06894, over 4268447.20 frames. ], batch size: 372, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:28:10,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1559202.0, ans=0.1 2023-06-26 11:28:18,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1559202.0, ans=0.125 2023-06-26 11:29:38,899 INFO [train.py:996] (3/4) Epoch 9, batch 15950, loss[loss=0.1897, simple_loss=0.2316, pruned_loss=0.07385, over 20773.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2818, pruned_loss=0.06673, over 4270124.80 frames. ], batch size: 609, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:30:26,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=22.5 2023-06-26 11:30:52,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1559622.0, ans=0.0 2023-06-26 11:31:17,723 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.330e+02 4.974e+02 7.474e+02 9.810e+02 2.700e+03, threshold=1.495e+03, percent-clipped=8.0 2023-06-26 11:31:28,096 INFO [train.py:996] (3/4) Epoch 9, batch 16000, loss[loss=0.2277, simple_loss=0.3181, pruned_loss=0.06864, over 21810.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2842, pruned_loss=0.06555, over 4261754.46 frames. ], batch size: 351, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:31:50,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=22.5 2023-06-26 11:32:16,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1559862.0, ans=0.125 2023-06-26 11:32:23,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1559862.0, ans=0.125 2023-06-26 11:32:23,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1559862.0, ans=0.0 2023-06-26 11:33:04,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1559982.0, ans=0.0 2023-06-26 11:33:17,714 INFO [train.py:996] (3/4) Epoch 9, batch 16050, loss[loss=0.2648, simple_loss=0.3645, pruned_loss=0.0826, over 21697.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2875, pruned_loss=0.06391, over 4261461.95 frames. ], batch size: 441, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:33:57,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1560162.0, ans=0.125 2023-06-26 11:34:07,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1560162.0, ans=0.5 2023-06-26 11:34:45,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.405e+02 5.738e+02 8.747e+02 1.434e+03 3.009e+03, threshold=1.749e+03, percent-clipped=21.0 2023-06-26 11:35:05,344 INFO [train.py:996] (3/4) Epoch 9, batch 16100, loss[loss=0.2765, simple_loss=0.3502, pruned_loss=0.1015, over 21584.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2935, pruned_loss=0.06575, over 4261660.11 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:35:10,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1560342.0, ans=0.0 2023-06-26 11:35:22,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1560342.0, ans=0.125 2023-06-26 11:36:30,039 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=15.0 2023-06-26 11:36:31,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1560582.0, ans=0.2 2023-06-26 11:36:38,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-26 11:36:54,137 INFO [train.py:996] (3/4) Epoch 9, batch 16150, loss[loss=0.1757, simple_loss=0.2476, pruned_loss=0.05189, over 20216.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2932, pruned_loss=0.06667, over 4260523.31 frames. ], batch size: 703, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:37:34,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1560762.0, ans=0.125 2023-06-26 11:38:05,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1560822.0, ans=0.0 2023-06-26 11:38:10,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.00 vs. limit=10.0 2023-06-26 11:38:19,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1560882.0, ans=0.125 2023-06-26 11:38:32,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=15.0 2023-06-26 11:38:33,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.369e+02 5.523e+02 8.339e+02 1.289e+03 2.279e+03, threshold=1.668e+03, percent-clipped=10.0 2023-06-26 11:38:45,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1560942.0, ans=0.125 2023-06-26 11:38:46,833 INFO [train.py:996] (3/4) Epoch 9, batch 16200, loss[loss=0.2327, simple_loss=0.3148, pruned_loss=0.07536, over 21837.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2967, pruned_loss=0.06759, over 4269427.70 frames. ], batch size: 247, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:38:49,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-26 11:39:07,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1561002.0, ans=0.0 2023-06-26 11:39:42,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1561062.0, ans=0.04949747468305833 2023-06-26 11:39:42,763 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.47 vs. limit=15.0 2023-06-26 11:39:51,498 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-26 11:40:01,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1561122.0, ans=0.125 2023-06-26 11:40:02,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.09 vs. limit=12.0 2023-06-26 11:40:07,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-26 11:40:34,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-26 11:40:38,344 INFO [train.py:996] (3/4) Epoch 9, batch 16250, loss[loss=0.2176, simple_loss=0.2865, pruned_loss=0.07431, over 20067.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2973, pruned_loss=0.06907, over 4267416.09 frames. ], batch size: 702, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:41:16,990 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-26 11:41:28,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1561362.0, ans=0.0 2023-06-26 11:42:17,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.175e+02 4.964e+02 6.149e+02 9.832e+02 2.311e+03, threshold=1.230e+03, percent-clipped=3.0 2023-06-26 11:42:20,686 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-26 11:42:26,959 INFO [train.py:996] (3/4) Epoch 9, batch 16300, loss[loss=0.2217, simple_loss=0.2875, pruned_loss=0.07798, over 21345.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2908, pruned_loss=0.06598, over 4267141.48 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:44:09,279 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-26 11:44:11,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1561782.0, ans=0.125 2023-06-26 11:44:16,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1561842.0, ans=0.04949747468305833 2023-06-26 11:44:17,114 INFO [train.py:996] (3/4) Epoch 9, batch 16350, loss[loss=0.232, simple_loss=0.3086, pruned_loss=0.07767, over 20761.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2889, pruned_loss=0.06554, over 4265482.28 frames. ], batch size: 611, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:44:19,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1561842.0, ans=0.125 2023-06-26 11:44:19,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1561842.0, ans=0.125 2023-06-26 11:44:21,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1561842.0, ans=0.1 2023-06-26 11:45:56,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.631e+02 4.777e+02 5.847e+02 7.634e+02 1.657e+03, threshold=1.169e+03, percent-clipped=4.0 2023-06-26 11:46:05,008 INFO [train.py:996] (3/4) Epoch 9, batch 16400, loss[loss=0.2167, simple_loss=0.2889, pruned_loss=0.07226, over 21432.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2924, pruned_loss=0.06692, over 4270386.26 frames. ], batch size: 131, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 11:46:20,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.17 vs. limit=5.0 2023-06-26 11:47:26,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1562322.0, ans=0.125 2023-06-26 11:47:54,198 INFO [train.py:996] (3/4) Epoch 9, batch 16450, loss[loss=0.2275, simple_loss=0.298, pruned_loss=0.07845, over 21873.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.293, pruned_loss=0.06794, over 4278990.30 frames. ], batch size: 351, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:47:55,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1562442.0, ans=0.0 2023-06-26 11:48:44,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1562562.0, ans=0.125 2023-06-26 11:49:36,729 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.419e+02 4.948e+02 6.287e+02 8.709e+02 1.538e+03, threshold=1.257e+03, percent-clipped=9.0 2023-06-26 11:49:44,329 INFO [train.py:996] (3/4) Epoch 9, batch 16500, loss[loss=0.166, simple_loss=0.2217, pruned_loss=0.05513, over 21878.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2896, pruned_loss=0.06797, over 4276016.28 frames. ], batch size: 107, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:50:12,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1562802.0, ans=0.0 2023-06-26 11:50:14,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1562802.0, ans=0.125 2023-06-26 11:50:22,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1562802.0, ans=0.125 2023-06-26 11:50:46,015 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=22.5 2023-06-26 11:50:49,816 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-26 11:50:51,478 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-26 11:51:34,679 INFO [train.py:996] (3/4) Epoch 9, batch 16550, loss[loss=0.2476, simple_loss=0.3338, pruned_loss=0.08064, over 21552.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2888, pruned_loss=0.06671, over 4270310.27 frames. ], batch size: 414, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:53:10,238 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-26 11:53:24,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.329e+02 6.129e+02 9.986e+02 1.624e+03 3.562e+03, threshold=1.997e+03, percent-clipped=34.0 2023-06-26 11:53:31,907 INFO [train.py:996] (3/4) Epoch 9, batch 16600, loss[loss=0.2067, simple_loss=0.309, pruned_loss=0.05215, over 20872.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2968, pruned_loss=0.06913, over 4270505.93 frames. ], batch size: 608, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:53:42,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1563342.0, ans=0.1 2023-06-26 11:54:26,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1563462.0, ans=0.125 2023-06-26 11:54:53,792 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=22.5 2023-06-26 11:55:29,108 INFO [train.py:996] (3/4) Epoch 9, batch 16650, loss[loss=0.316, simple_loss=0.3764, pruned_loss=0.1278, over 21291.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3062, pruned_loss=0.07119, over 4267727.48 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:55:32,103 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-26 11:56:24,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-26 11:56:41,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.95 vs. limit=10.0 2023-06-26 11:57:21,319 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.537e+02 4.947e+02 6.891e+02 9.517e+02 1.890e+03, threshold=1.378e+03, percent-clipped=0.0 2023-06-26 11:57:33,708 INFO [train.py:996] (3/4) Epoch 9, batch 16700, loss[loss=0.2388, simple_loss=0.3201, pruned_loss=0.07875, over 20692.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3078, pruned_loss=0.07261, over 4264771.80 frames. ], batch size: 607, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:57:40,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-26 11:57:41,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1563942.0, ans=0.0 2023-06-26 11:59:16,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.25 vs. limit=12.0 2023-06-26 11:59:27,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1564242.0, ans=0.125 2023-06-26 11:59:29,000 INFO [train.py:996] (3/4) Epoch 9, batch 16750, loss[loss=0.2399, simple_loss=0.3191, pruned_loss=0.08032, over 21596.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3094, pruned_loss=0.07469, over 4268940.69 frames. ], batch size: 263, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 11:59:31,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1564242.0, ans=0.0 2023-06-26 12:00:13,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1564302.0, ans=0.125 2023-06-26 12:01:13,731 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.747e+02 5.518e+02 7.489e+02 1.102e+03 1.868e+03, threshold=1.498e+03, percent-clipped=9.0 2023-06-26 12:01:20,315 INFO [train.py:996] (3/4) Epoch 9, batch 16800, loss[loss=0.2174, simple_loss=0.2824, pruned_loss=0.07621, over 21445.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3151, pruned_loss=0.07457, over 4264804.42 frames. ], batch size: 211, lr: 3.28e-03, grad_scale: 32.0 2023-06-26 12:01:57,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-26 12:02:49,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-26 12:03:09,551 INFO [train.py:996] (3/4) Epoch 9, batch 16850, loss[loss=0.2154, simple_loss=0.2781, pruned_loss=0.07629, over 21554.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3107, pruned_loss=0.0751, over 4265211.52 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-26 12:03:18,092 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-26 12:03:40,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.80 vs. limit=10.0 2023-06-26 12:04:18,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1564962.0, ans=0.0 2023-06-26 12:04:21,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1565022.0, ans=0.2 2023-06-26 12:04:34,568 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2023-06-26 12:04:47,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1565082.0, ans=0.1 2023-06-26 12:04:52,107 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.430e+02 5.138e+02 7.609e+02 1.062e+03 2.399e+03, threshold=1.522e+03, percent-clipped=7.0 2023-06-26 12:05:02,264 INFO [train.py:996] (3/4) Epoch 9, batch 16900, loss[loss=0.2173, simple_loss=0.2705, pruned_loss=0.08208, over 20258.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3059, pruned_loss=0.07387, over 4268399.44 frames. ], batch size: 707, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:05:19,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.56 vs. limit=15.0 2023-06-26 12:05:42,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1565202.0, ans=0.0 2023-06-26 12:05:42,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1565202.0, ans=0.125 2023-06-26 12:06:26,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1565382.0, ans=0.125 2023-06-26 12:06:43,792 INFO [train.py:996] (3/4) Epoch 9, batch 16950, loss[loss=0.2001, simple_loss=0.2714, pruned_loss=0.06434, over 21811.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2995, pruned_loss=0.07289, over 4276084.36 frames. ], batch size: 298, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:08:22,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1565682.0, ans=0.125 2023-06-26 12:08:27,255 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.850e+02 5.163e+02 6.810e+02 8.799e+02 2.047e+03, threshold=1.362e+03, percent-clipped=3.0 2023-06-26 12:08:32,635 INFO [train.py:996] (3/4) Epoch 9, batch 17000, loss[loss=0.2113, simple_loss=0.2806, pruned_loss=0.07103, over 21673.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2967, pruned_loss=0.07349, over 4279389.60 frames. ], batch size: 263, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:08:43,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1565742.0, ans=0.2 2023-06-26 12:08:55,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1565742.0, ans=0.125 2023-06-26 12:10:29,845 INFO [train.py:996] (3/4) Epoch 9, batch 17050, loss[loss=0.2588, simple_loss=0.3397, pruned_loss=0.08899, over 21409.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3029, pruned_loss=0.075, over 4288771.96 frames. ], batch size: 548, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:10:46,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1566042.0, ans=0.0 2023-06-26 12:10:51,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1566102.0, ans=0.0 2023-06-26 12:11:09,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1566102.0, ans=0.0 2023-06-26 12:11:35,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1566222.0, ans=0.125 2023-06-26 12:11:43,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1566222.0, ans=0.125 2023-06-26 12:11:50,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1566222.0, ans=0.125 2023-06-26 12:11:53,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1566282.0, ans=0.0 2023-06-26 12:11:57,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1566282.0, ans=0.125 2023-06-26 12:12:06,987 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.631e+02 5.718e+02 8.769e+02 1.372e+03 2.605e+03, threshold=1.754e+03, percent-clipped=26.0 2023-06-26 12:12:17,828 INFO [train.py:996] (3/4) Epoch 9, batch 17100, loss[loss=0.194, simple_loss=0.2671, pruned_loss=0.06048, over 21924.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3009, pruned_loss=0.07541, over 4295747.54 frames. ], batch size: 316, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:12:42,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1566402.0, ans=0.125 2023-06-26 12:13:18,391 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-26 12:13:21,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1566462.0, ans=0.125 2023-06-26 12:13:23,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1566462.0, ans=0.0 2023-06-26 12:13:27,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1566522.0, ans=0.0 2023-06-26 12:14:10,886 INFO [train.py:996] (3/4) Epoch 9, batch 17150, loss[loss=0.1725, simple_loss=0.2473, pruned_loss=0.04881, over 21466.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2967, pruned_loss=0.07416, over 4298117.21 frames. ], batch size: 194, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:14:57,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1566762.0, ans=0.0 2023-06-26 12:15:10,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1566762.0, ans=0.125 2023-06-26 12:15:52,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1566882.0, ans=0.125 2023-06-26 12:15:55,037 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.574e+02 4.824e+02 6.813e+02 1.101e+03 2.342e+03, threshold=1.363e+03, percent-clipped=2.0 2023-06-26 12:16:00,474 INFO [train.py:996] (3/4) Epoch 9, batch 17200, loss[loss=0.2151, simple_loss=0.293, pruned_loss=0.06858, over 21755.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.298, pruned_loss=0.07413, over 4292769.37 frames. ], batch size: 298, lr: 3.27e-03, grad_scale: 32.0 2023-06-26 12:16:21,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=15.0 2023-06-26 12:16:29,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1567002.0, ans=0.1 2023-06-26 12:17:01,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1567062.0, ans=0.0 2023-06-26 12:18:02,436 INFO [train.py:996] (3/4) Epoch 9, batch 17250, loss[loss=0.2478, simple_loss=0.3452, pruned_loss=0.07521, over 17335.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3005, pruned_loss=0.07525, over 4281443.69 frames. ], batch size: 60, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:18:06,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1567242.0, ans=0.125 2023-06-26 12:19:11,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1567422.0, ans=0.125 2023-06-26 12:19:21,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-26 12:19:48,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.788e+02 5.514e+02 7.810e+02 1.291e+03 2.321e+03, threshold=1.562e+03, percent-clipped=17.0 2023-06-26 12:19:52,278 INFO [train.py:996] (3/4) Epoch 9, batch 17300, loss[loss=0.2936, simple_loss=0.3572, pruned_loss=0.115, over 21433.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.309, pruned_loss=0.07865, over 4284751.12 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:20:07,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1567542.0, ans=0.2 2023-06-26 12:20:16,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1567602.0, ans=0.04949747468305833 2023-06-26 12:21:09,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1567722.0, ans=0.125 2023-06-26 12:21:38,703 INFO [train.py:996] (3/4) Epoch 9, batch 17350, loss[loss=0.2449, simple_loss=0.3371, pruned_loss=0.07637, over 21500.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3083, pruned_loss=0.07835, over 4278554.80 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:21:52,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1567842.0, ans=0.1 2023-06-26 12:21:55,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1567902.0, ans=0.035 2023-06-26 12:21:59,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1567902.0, ans=0.125 2023-06-26 12:22:23,135 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-26 12:22:51,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1568022.0, ans=0.015 2023-06-26 12:23:01,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1568082.0, ans=0.125 2023-06-26 12:23:15,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.405e+02 5.455e+02 8.630e+02 1.274e+03 2.528e+03, threshold=1.726e+03, percent-clipped=16.0 2023-06-26 12:23:19,231 INFO [train.py:996] (3/4) Epoch 9, batch 17400, loss[loss=0.1409, simple_loss=0.1917, pruned_loss=0.04508, over 16692.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3052, pruned_loss=0.0755, over 4269497.37 frames. ], batch size: 60, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:23:24,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1568142.0, ans=0.0 2023-06-26 12:23:24,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1568142.0, ans=0.125 2023-06-26 12:23:44,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1568202.0, ans=0.1 2023-06-26 12:23:47,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1568202.0, ans=0.125 2023-06-26 12:25:10,950 INFO [train.py:996] (3/4) Epoch 9, batch 17450, loss[loss=0.1842, simple_loss=0.2811, pruned_loss=0.04369, over 21764.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3012, pruned_loss=0.07257, over 4271026.81 frames. ], batch size: 351, lr: 3.27e-03, grad_scale: 8.0 2023-06-26 12:25:41,205 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-26 12:26:31,769 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-26 12:26:57,093 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.248e+02 4.686e+02 6.725e+02 1.029e+03 2.928e+03, threshold=1.345e+03, percent-clipped=7.0 2023-06-26 12:26:57,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1568742.0, ans=0.125 2023-06-26 12:26:58,681 INFO [train.py:996] (3/4) Epoch 9, batch 17500, loss[loss=0.2414, simple_loss=0.3186, pruned_loss=0.08213, over 21893.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2977, pruned_loss=0.07028, over 4271382.62 frames. ], batch size: 118, lr: 3.27e-03, grad_scale: 8.0 2023-06-26 12:27:33,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1568802.0, ans=0.0 2023-06-26 12:27:56,911 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-26 12:28:10,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-26 12:28:39,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1569042.0, ans=0.0 2023-06-26 12:28:40,998 INFO [train.py:996] (3/4) Epoch 9, batch 17550, loss[loss=0.224, simple_loss=0.3097, pruned_loss=0.06917, over 21803.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2985, pruned_loss=0.06917, over 4275060.33 frames. ], batch size: 124, lr: 3.27e-03, grad_scale: 8.0 2023-06-26 12:29:12,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1569102.0, ans=0.0 2023-06-26 12:29:29,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1569162.0, ans=0.125 2023-06-26 12:29:30,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1569162.0, ans=0.0 2023-06-26 12:30:02,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1569222.0, ans=0.2 2023-06-26 12:30:03,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1569222.0, ans=0.1 2023-06-26 12:30:34,070 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.446e+02 4.559e+02 6.477e+02 8.639e+02 1.603e+03, threshold=1.295e+03, percent-clipped=2.0 2023-06-26 12:30:35,780 INFO [train.py:996] (3/4) Epoch 9, batch 17600, loss[loss=0.2219, simple_loss=0.3034, pruned_loss=0.07016, over 21507.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3011, pruned_loss=0.07022, over 4279852.94 frames. ], batch size: 194, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:30:43,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1569342.0, ans=0.125 2023-06-26 12:30:45,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1569342.0, ans=0.125 2023-06-26 12:31:28,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1569462.0, ans=0.2 2023-06-26 12:32:20,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1569642.0, ans=0.1 2023-06-26 12:32:21,706 INFO [train.py:996] (3/4) Epoch 9, batch 17650, loss[loss=0.2101, simple_loss=0.2959, pruned_loss=0.06212, over 20813.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2986, pruned_loss=0.06992, over 4260626.15 frames. ], batch size: 609, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:32:27,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-26 12:32:34,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-26 12:32:41,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1569642.0, ans=0.125 2023-06-26 12:33:15,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1569762.0, ans=0.2 2023-06-26 12:33:18,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1569762.0, ans=0.0 2023-06-26 12:33:49,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1569822.0, ans=0.125 2023-06-26 12:33:58,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1569882.0, ans=0.0 2023-06-26 12:34:09,281 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.770e+02 6.180e+02 8.586e+02 1.472e+03 2.723e+03, threshold=1.717e+03, percent-clipped=31.0 2023-06-26 12:34:10,903 INFO [train.py:996] (3/4) Epoch 9, batch 17700, loss[loss=0.2649, simple_loss=0.3383, pruned_loss=0.09574, over 21765.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2955, pruned_loss=0.06775, over 4264597.56 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:34:32,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1570002.0, ans=0.125 2023-06-26 12:34:48,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1570002.0, ans=0.04949747468305833 2023-06-26 12:35:03,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1570062.0, ans=0.125 2023-06-26 12:36:06,982 INFO [train.py:996] (3/4) Epoch 9, batch 17750, loss[loss=0.2417, simple_loss=0.3215, pruned_loss=0.08091, over 21763.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3015, pruned_loss=0.07044, over 4261910.27 frames. ], batch size: 332, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:37:52,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1570482.0, ans=0.0 2023-06-26 12:37:56,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.878e+02 5.343e+02 8.043e+02 1.136e+03 2.008e+03, threshold=1.609e+03, percent-clipped=5.0 2023-06-26 12:38:04,121 INFO [train.py:996] (3/4) Epoch 9, batch 17800, loss[loss=0.1853, simple_loss=0.2728, pruned_loss=0.04887, over 19826.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2998, pruned_loss=0.06867, over 4264212.99 frames. ], batch size: 702, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:38:04,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1570542.0, ans=0.125 2023-06-26 12:38:11,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-26 12:38:15,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1570542.0, ans=0.025 2023-06-26 12:39:04,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1570662.0, ans=0.125 2023-06-26 12:39:09,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1570722.0, ans=0.0 2023-06-26 12:39:34,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1570782.0, ans=0.0 2023-06-26 12:39:55,305 INFO [train.py:996] (3/4) Epoch 9, batch 17850, loss[loss=0.2237, simple_loss=0.3042, pruned_loss=0.07162, over 21435.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3009, pruned_loss=0.0692, over 4270715.60 frames. ], batch size: 131, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:39:57,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1570842.0, ans=0.2 2023-06-26 12:40:02,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1570842.0, ans=0.05 2023-06-26 12:40:19,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1570902.0, ans=0.125 2023-06-26 12:40:21,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1570902.0, ans=0.0 2023-06-26 12:40:23,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1570902.0, ans=0.125 2023-06-26 12:41:14,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1571022.0, ans=0.04949747468305833 2023-06-26 12:41:24,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1571082.0, ans=0.0 2023-06-26 12:41:26,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-26 12:41:42,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 5.491e+02 8.059e+02 1.156e+03 1.916e+03, threshold=1.612e+03, percent-clipped=10.0 2023-06-26 12:41:43,901 INFO [train.py:996] (3/4) Epoch 9, batch 17900, loss[loss=0.2061, simple_loss=0.2965, pruned_loss=0.05782, over 21261.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3048, pruned_loss=0.07118, over 4275003.41 frames. ], batch size: 159, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:41:53,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1571142.0, ans=0.07 2023-06-26 12:42:37,828 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-26 12:42:42,968 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-26 12:42:47,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1571322.0, ans=0.125 2023-06-26 12:43:22,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-26 12:43:37,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.97 vs. limit=10.0 2023-06-26 12:43:40,938 INFO [train.py:996] (3/4) Epoch 9, batch 17950, loss[loss=0.2141, simple_loss=0.3056, pruned_loss=0.06133, over 21639.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3044, pruned_loss=0.06839, over 4267624.53 frames. ], batch size: 414, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:44:08,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1571502.0, ans=0.035 2023-06-26 12:44:34,147 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-26 12:44:56,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1571622.0, ans=0.125 2023-06-26 12:45:24,823 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.199e+02 4.426e+02 5.727e+02 7.254e+02 1.857e+03, threshold=1.145e+03, percent-clipped=1.0 2023-06-26 12:45:26,472 INFO [train.py:996] (3/4) Epoch 9, batch 18000, loss[loss=0.2056, simple_loss=0.2747, pruned_loss=0.06821, over 21539.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2975, pruned_loss=0.06739, over 4262710.28 frames. ], batch size: 414, lr: 3.27e-03, grad_scale: 32.0 2023-06-26 12:45:26,472 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 12:45:46,675 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2587, simple_loss=0.3543, pruned_loss=0.08153, over 1796401.00 frames. 2023-06-26 12:45:46,676 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 12:46:31,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1571862.0, ans=0.0 2023-06-26 12:47:19,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1571982.0, ans=0.125 2023-06-26 12:47:24,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1571982.0, ans=0.1 2023-06-26 12:47:36,552 INFO [train.py:996] (3/4) Epoch 9, batch 18050, loss[loss=0.2256, simple_loss=0.2908, pruned_loss=0.08015, over 21622.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2917, pruned_loss=0.06641, over 4266368.75 frames. ], batch size: 415, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:47:44,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-06-26 12:47:44,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-26 12:48:49,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1572222.0, ans=0.125 2023-06-26 12:49:28,425 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.634e+02 5.417e+02 6.596e+02 1.071e+03 2.802e+03, threshold=1.319e+03, percent-clipped=21.0 2023-06-26 12:49:28,455 INFO [train.py:996] (3/4) Epoch 9, batch 18100, loss[loss=0.229, simple_loss=0.3005, pruned_loss=0.07876, over 21775.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2966, pruned_loss=0.06896, over 4266354.88 frames. ], batch size: 102, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:50:15,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.21 vs. limit=15.0 2023-06-26 12:50:16,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1572462.0, ans=0.04949747468305833 2023-06-26 12:50:42,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.96 vs. limit=22.5 2023-06-26 12:50:43,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1572522.0, ans=0.125 2023-06-26 12:51:15,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1572582.0, ans=0.125 2023-06-26 12:51:18,347 INFO [train.py:996] (3/4) Epoch 9, batch 18150, loss[loss=0.2021, simple_loss=0.2729, pruned_loss=0.06566, over 21508.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2991, pruned_loss=0.06898, over 4273063.06 frames. ], batch size: 195, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:51:37,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1572642.0, ans=0.125 2023-06-26 12:51:46,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1572702.0, ans=0.125 2023-06-26 12:52:17,669 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=12.0 2023-06-26 12:52:18,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1572762.0, ans=0.0 2023-06-26 12:52:58,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-26 12:53:05,708 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.440e+02 4.565e+02 5.741e+02 8.756e+02 1.817e+03, threshold=1.148e+03, percent-clipped=4.0 2023-06-26 12:53:05,739 INFO [train.py:996] (3/4) Epoch 9, batch 18200, loss[loss=0.1795, simple_loss=0.2482, pruned_loss=0.0554, over 21582.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2929, pruned_loss=0.06873, over 4264475.69 frames. ], batch size: 263, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:53:12,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1572942.0, ans=0.125 2023-06-26 12:53:42,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1573002.0, ans=0.125 2023-06-26 12:54:00,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1573122.0, ans=0.1 2023-06-26 12:54:10,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-26 12:54:22,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1573122.0, ans=0.1 2023-06-26 12:54:47,258 INFO [train.py:996] (3/4) Epoch 9, batch 18250, loss[loss=0.2119, simple_loss=0.2804, pruned_loss=0.07168, over 21838.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.286, pruned_loss=0.06688, over 4256247.63 frames. ], batch size: 351, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:55:22,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1573302.0, ans=0.125 2023-06-26 12:55:23,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1573302.0, ans=0.05 2023-06-26 12:56:42,100 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.384e+02 4.816e+02 6.355e+02 8.859e+02 2.523e+03, threshold=1.271e+03, percent-clipped=14.0 2023-06-26 12:56:42,144 INFO [train.py:996] (3/4) Epoch 9, batch 18300, loss[loss=0.1692, simple_loss=0.247, pruned_loss=0.04567, over 21298.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2847, pruned_loss=0.06678, over 4258377.33 frames. ], batch size: 131, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:57:03,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1573602.0, ans=0.04949747468305833 2023-06-26 12:57:14,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1573602.0, ans=0.125 2023-06-26 12:57:26,911 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-26 12:58:08,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1573782.0, ans=0.2 2023-06-26 12:58:21,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1573782.0, ans=0.125 2023-06-26 12:58:25,481 INFO [train.py:996] (3/4) Epoch 9, batch 18350, loss[loss=0.1687, simple_loss=0.2498, pruned_loss=0.04378, over 15889.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2901, pruned_loss=0.06611, over 4257142.00 frames. ], batch size: 61, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 12:58:29,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-26 12:59:26,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1573962.0, ans=0.2 2023-06-26 12:59:43,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1574022.0, ans=0.125 2023-06-26 12:59:48,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1574022.0, ans=0.1 2023-06-26 13:00:14,623 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.067e+02 5.416e+02 7.058e+02 9.535e+02 2.465e+03, threshold=1.412e+03, percent-clipped=12.0 2023-06-26 13:00:14,654 INFO [train.py:996] (3/4) Epoch 9, batch 18400, loss[loss=0.2172, simple_loss=0.2948, pruned_loss=0.06983, over 21196.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2871, pruned_loss=0.06508, over 4252788.54 frames. ], batch size: 159, lr: 3.27e-03, grad_scale: 32.0 2023-06-26 13:01:03,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1574262.0, ans=0.95 2023-06-26 13:02:04,278 INFO [train.py:996] (3/4) Epoch 9, batch 18450, loss[loss=0.2553, simple_loss=0.3962, pruned_loss=0.05715, over 19717.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.287, pruned_loss=0.06188, over 4253090.24 frames. ], batch size: 702, lr: 3.27e-03, grad_scale: 16.0 2023-06-26 13:02:54,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1574562.0, ans=0.125 2023-06-26 13:03:19,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1574622.0, ans=0.125 2023-06-26 13:03:28,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.71 vs. limit=15.0 2023-06-26 13:03:52,203 INFO [train.py:996] (3/4) Epoch 9, batch 18500, loss[loss=0.202, simple_loss=0.2636, pruned_loss=0.07019, over 21335.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2824, pruned_loss=0.06049, over 4252918.67 frames. ], batch size: 144, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:03:53,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.096e+02 4.756e+02 7.398e+02 1.037e+03 4.377e+03, threshold=1.480e+03, percent-clipped=11.0 2023-06-26 13:03:54,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1574742.0, ans=0.1 2023-06-26 13:03:59,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1574742.0, ans=0.125 2023-06-26 13:04:16,111 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-26 13:04:56,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1574922.0, ans=0.0 2023-06-26 13:05:05,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1574922.0, ans=0.0 2023-06-26 13:05:26,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1574982.0, ans=0.125 2023-06-26 13:05:27,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1574982.0, ans=0.1 2023-06-26 13:05:29,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1574982.0, ans=0.04949747468305833 2023-06-26 13:05:40,072 INFO [train.py:996] (3/4) Epoch 9, batch 18550, loss[loss=0.187, simple_loss=0.2497, pruned_loss=0.06213, over 21206.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2783, pruned_loss=0.06004, over 4246688.98 frames. ], batch size: 548, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:05:44,774 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-26 13:05:59,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1575102.0, ans=0.0 2023-06-26 13:06:01,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1575102.0, ans=0.1 2023-06-26 13:06:03,601 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.63 vs. limit=15.0 2023-06-26 13:06:37,234 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-06-26 13:07:28,386 INFO [train.py:996] (3/4) Epoch 9, batch 18600, loss[loss=0.1968, simple_loss=0.2736, pruned_loss=0.05999, over 21551.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2766, pruned_loss=0.06101, over 4233402.25 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:07:30,256 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.128e+02 4.633e+02 7.387e+02 1.048e+03 1.831e+03, threshold=1.477e+03, percent-clipped=1.0 2023-06-26 13:08:07,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-26 13:08:43,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1575522.0, ans=0.125 2023-06-26 13:09:15,092 INFO [train.py:996] (3/4) Epoch 9, batch 18650, loss[loss=0.2312, simple_loss=0.3148, pruned_loss=0.07382, over 21782.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2763, pruned_loss=0.06149, over 4240184.82 frames. ], batch size: 391, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:09:20,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1575642.0, ans=0.5 2023-06-26 13:09:58,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1575762.0, ans=0.125 2023-06-26 13:10:12,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1575822.0, ans=0.125 2023-06-26 13:11:02,347 INFO [train.py:996] (3/4) Epoch 9, batch 18700, loss[loss=0.2437, simple_loss=0.29, pruned_loss=0.0987, over 21599.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2747, pruned_loss=0.06251, over 4246842.11 frames. ], batch size: 508, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:11:04,040 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.187e+02 4.395e+02 5.926e+02 8.949e+02 1.374e+03, threshold=1.185e+03, percent-clipped=0.0 2023-06-26 13:11:25,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1576002.0, ans=0.125 2023-06-26 13:11:36,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1576062.0, ans=0.125 2023-06-26 13:11:36,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-26 13:11:46,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1576062.0, ans=0.015 2023-06-26 13:12:49,694 INFO [train.py:996] (3/4) Epoch 9, batch 18750, loss[loss=0.2506, simple_loss=0.3246, pruned_loss=0.08827, over 21816.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2771, pruned_loss=0.06512, over 4265656.07 frames. ], batch size: 118, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:12:50,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1576242.0, ans=0.0 2023-06-26 13:13:17,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1576302.0, ans=0.125 2023-06-26 13:13:36,683 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 13:13:38,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1576362.0, ans=0.125 2023-06-26 13:13:49,740 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-26 13:14:37,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1576542.0, ans=0.125 2023-06-26 13:14:38,310 INFO [train.py:996] (3/4) Epoch 9, batch 18800, loss[loss=0.1953, simple_loss=0.2904, pruned_loss=0.05016, over 21618.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2843, pruned_loss=0.06684, over 4271805.48 frames. ], batch size: 263, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:14:40,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 6.038e+02 7.723e+02 1.097e+03 3.023e+03, threshold=1.545e+03, percent-clipped=19.0 2023-06-26 13:14:53,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1576542.0, ans=0.2 2023-06-26 13:15:09,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1576602.0, ans=0.125 2023-06-26 13:16:18,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-26 13:16:27,803 INFO [train.py:996] (3/4) Epoch 9, batch 18850, loss[loss=0.2095, simple_loss=0.2793, pruned_loss=0.06987, over 21847.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.282, pruned_loss=0.06284, over 4276586.73 frames. ], batch size: 107, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:16:28,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1576842.0, ans=0.125 2023-06-26 13:18:14,392 INFO [train.py:996] (3/4) Epoch 9, batch 18900, loss[loss=0.2131, simple_loss=0.2733, pruned_loss=0.07649, over 21612.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2777, pruned_loss=0.06238, over 4280921.03 frames. ], batch size: 415, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:18:17,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.228e+02 4.531e+02 6.963e+02 9.490e+02 1.932e+03, threshold=1.393e+03, percent-clipped=3.0 2023-06-26 13:18:30,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1577202.0, ans=0.125 2023-06-26 13:18:56,967 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-26 13:19:33,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1577322.0, ans=0.1 2023-06-26 13:19:48,786 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=15.0 2023-06-26 13:19:49,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1577382.0, ans=0.2 2023-06-26 13:19:55,822 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-26 13:20:03,846 INFO [train.py:996] (3/4) Epoch 9, batch 18950, loss[loss=0.1723, simple_loss=0.2297, pruned_loss=0.05744, over 21164.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2756, pruned_loss=0.06402, over 4282131.74 frames. ], batch size: 608, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:20:33,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1577502.0, ans=0.125 2023-06-26 13:20:58,323 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-26 13:21:19,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1577622.0, ans=0.2 2023-06-26 13:21:29,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-26 13:21:37,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1577682.0, ans=0.125 2023-06-26 13:21:54,021 INFO [train.py:996] (3/4) Epoch 9, batch 19000, loss[loss=0.236, simple_loss=0.3148, pruned_loss=0.07864, over 21837.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2861, pruned_loss=0.06699, over 4276736.84 frames. ], batch size: 282, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:21:58,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.501e+02 4.865e+02 6.670e+02 8.887e+02 1.787e+03, threshold=1.334e+03, percent-clipped=6.0 2023-06-26 13:22:09,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.02 vs. limit=22.5 2023-06-26 13:22:35,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1577862.0, ans=0.0 2023-06-26 13:23:00,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1577922.0, ans=0.0 2023-06-26 13:23:18,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1577982.0, ans=0.125 2023-06-26 13:23:29,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1577982.0, ans=0.125 2023-06-26 13:23:37,523 INFO [train.py:996] (3/4) Epoch 9, batch 19050, loss[loss=0.2297, simple_loss=0.296, pruned_loss=0.08168, over 21334.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2913, pruned_loss=0.06988, over 4270762.96 frames. ], batch size: 176, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:23:44,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1578042.0, ans=0.2 2023-06-26 13:25:20,507 INFO [train.py:996] (3/4) Epoch 9, batch 19100, loss[loss=0.2243, simple_loss=0.282, pruned_loss=0.08334, over 21328.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.29, pruned_loss=0.07071, over 4277534.68 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:25:24,158 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.781e+02 5.304e+02 7.054e+02 1.099e+03 1.877e+03, threshold=1.411e+03, percent-clipped=10.0 2023-06-26 13:25:48,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1578402.0, ans=0.125 2023-06-26 13:26:10,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1578462.0, ans=0.07 2023-06-26 13:26:10,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1578462.0, ans=0.0 2023-06-26 13:26:59,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-26 13:27:11,383 INFO [train.py:996] (3/4) Epoch 9, batch 19150, loss[loss=0.2949, simple_loss=0.3881, pruned_loss=0.1008, over 21531.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2919, pruned_loss=0.07142, over 4276726.23 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:28:09,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1578762.0, ans=0.125 2023-06-26 13:28:14,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1578762.0, ans=0.125 2023-06-26 13:28:19,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1578762.0, ans=0.125 2023-06-26 13:28:25,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1578822.0, ans=0.0 2023-06-26 13:28:30,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1578822.0, ans=0.125 2023-06-26 13:29:06,064 INFO [train.py:996] (3/4) Epoch 9, batch 19200, loss[loss=0.2167, simple_loss=0.3218, pruned_loss=0.05582, over 21601.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3025, pruned_loss=0.07249, over 4272338.60 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:29:10,044 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.893e+02 6.153e+02 9.835e+02 1.321e+03 2.570e+03, threshold=1.967e+03, percent-clipped=19.0 2023-06-26 13:29:50,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1579002.0, ans=0.125 2023-06-26 13:29:58,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1579062.0, ans=0.125 2023-06-26 13:30:09,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1579122.0, ans=0.0 2023-06-26 13:30:40,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-26 13:30:41,905 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 13:30:49,829 INFO [train.py:996] (3/4) Epoch 9, batch 19250, loss[loss=0.2031, simple_loss=0.2931, pruned_loss=0.05655, over 21784.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3023, pruned_loss=0.06772, over 4270191.26 frames. ], batch size: 414, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:31:23,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1579302.0, ans=0.125 2023-06-26 13:31:51,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1579362.0, ans=0.125 2023-06-26 13:32:04,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1579422.0, ans=0.125 2023-06-26 13:32:12,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1579422.0, ans=0.2 2023-06-26 13:32:22,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1579482.0, ans=0.0 2023-06-26 13:32:38,023 INFO [train.py:996] (3/4) Epoch 9, batch 19300, loss[loss=0.2196, simple_loss=0.2952, pruned_loss=0.07198, over 21550.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2995, pruned_loss=0.06773, over 4276494.51 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:32:41,544 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.842e+02 4.708e+02 6.632e+02 9.817e+02 2.132e+03, threshold=1.326e+03, percent-clipped=1.0 2023-06-26 13:33:59,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1579782.0, ans=0.125 2023-06-26 13:34:04,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1579782.0, ans=0.125 2023-06-26 13:34:23,279 INFO [train.py:996] (3/4) Epoch 9, batch 19350, loss[loss=0.1596, simple_loss=0.237, pruned_loss=0.04113, over 21307.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2955, pruned_loss=0.06408, over 4273281.12 frames. ], batch size: 131, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:35:00,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1579902.0, ans=0.1 2023-06-26 13:35:40,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1580022.0, ans=0.125 2023-06-26 13:36:07,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1580082.0, ans=0.125 2023-06-26 13:36:07,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1580082.0, ans=0.025 2023-06-26 13:36:10,337 INFO [train.py:996] (3/4) Epoch 9, batch 19400, loss[loss=0.2371, simple_loss=0.3176, pruned_loss=0.07832, over 21595.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.292, pruned_loss=0.06315, over 4274197.04 frames. ], batch size: 441, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:36:12,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1580142.0, ans=0.125 2023-06-26 13:36:15,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.056e+02 5.043e+02 7.685e+02 1.074e+03 1.940e+03, threshold=1.537e+03, percent-clipped=16.0 2023-06-26 13:36:50,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1580202.0, ans=0.125 2023-06-26 13:36:53,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1580202.0, ans=0.125 2023-06-26 13:37:10,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1580262.0, ans=0.125 2023-06-26 13:37:53,589 INFO [train.py:996] (3/4) Epoch 9, batch 19450, loss[loss=0.2123, simple_loss=0.2684, pruned_loss=0.07815, over 21477.00 frames. ], tot_loss[loss=0.209, simple_loss=0.289, pruned_loss=0.0645, over 4284038.80 frames. ], batch size: 441, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:38:13,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-26 13:38:14,550 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 13:38:21,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1580502.0, ans=0.2 2023-06-26 13:39:01,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1580622.0, ans=0.0 2023-06-26 13:39:03,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1580622.0, ans=0.2 2023-06-26 13:39:35,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1580682.0, ans=0.125 2023-06-26 13:39:37,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.13 vs. limit=15.0 2023-06-26 13:39:41,570 INFO [train.py:996] (3/4) Epoch 9, batch 19500, loss[loss=0.1686, simple_loss=0.2398, pruned_loss=0.04876, over 21400.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2849, pruned_loss=0.06558, over 4279825.46 frames. ], batch size: 160, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:39:46,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.457e+02 4.487e+02 6.079e+02 9.287e+02 2.149e+03, threshold=1.216e+03, percent-clipped=7.0 2023-06-26 13:41:31,317 INFO [train.py:996] (3/4) Epoch 9, batch 19550, loss[loss=0.2034, simple_loss=0.2918, pruned_loss=0.05754, over 21531.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2809, pruned_loss=0.06447, over 4277743.37 frames. ], batch size: 471, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:41:35,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1581042.0, ans=0.125 2023-06-26 13:42:19,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-26 13:42:30,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1581162.0, ans=0.125 2023-06-26 13:42:47,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1581222.0, ans=0.125 2023-06-26 13:43:18,850 INFO [train.py:996] (3/4) Epoch 9, batch 19600, loss[loss=0.2363, simple_loss=0.3125, pruned_loss=0.08008, over 21836.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2826, pruned_loss=0.06499, over 4279666.58 frames. ], batch size: 112, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:43:29,296 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.268e+02 5.045e+02 6.281e+02 9.154e+02 2.396e+03, threshold=1.256e+03, percent-clipped=14.0 2023-06-26 13:43:51,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1581402.0, ans=0.125 2023-06-26 13:43:56,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1581402.0, ans=0.2 2023-06-26 13:44:05,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1581402.0, ans=0.1 2023-06-26 13:44:25,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1581522.0, ans=0.07 2023-06-26 13:44:43,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1581522.0, ans=0.05 2023-06-26 13:44:45,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1581522.0, ans=0.125 2023-06-26 13:44:58,037 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-26 13:44:59,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1581582.0, ans=0.125 2023-06-26 13:45:13,602 INFO [train.py:996] (3/4) Epoch 9, batch 19650, loss[loss=0.2242, simple_loss=0.2968, pruned_loss=0.07573, over 21630.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2875, pruned_loss=0.0679, over 4287550.59 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:45:46,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1581702.0, ans=0.125 2023-06-26 13:45:56,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1581702.0, ans=15.0 2023-06-26 13:46:00,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1581762.0, ans=0.0 2023-06-26 13:46:05,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1581762.0, ans=0.125 2023-06-26 13:46:12,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1581762.0, ans=0.07 2023-06-26 13:46:42,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1581882.0, ans=0.125 2023-06-26 13:46:48,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.49 vs. limit=15.0 2023-06-26 13:46:49,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1581882.0, ans=0.125 2023-06-26 13:47:15,804 INFO [train.py:996] (3/4) Epoch 9, batch 19700, loss[loss=0.2077, simple_loss=0.3065, pruned_loss=0.05444, over 21677.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2912, pruned_loss=0.06916, over 4289067.19 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:47:22,842 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.461e+02 6.188e+02 8.447e+02 1.401e+03 2.428e+03, threshold=1.689e+03, percent-clipped=28.0 2023-06-26 13:47:52,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1582002.0, ans=0.0 2023-06-26 13:49:00,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1582182.0, ans=0.1 2023-06-26 13:49:05,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1582242.0, ans=0.0 2023-06-26 13:49:06,315 INFO [train.py:996] (3/4) Epoch 9, batch 19750, loss[loss=0.2416, simple_loss=0.3484, pruned_loss=0.06747, over 21775.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3022, pruned_loss=0.07115, over 4286627.28 frames. ], batch size: 282, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:49:16,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1582242.0, ans=0.125 2023-06-26 13:49:31,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-26 13:50:38,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1582482.0, ans=0.1 2023-06-26 13:50:55,595 INFO [train.py:996] (3/4) Epoch 9, batch 19800, loss[loss=0.191, simple_loss=0.2603, pruned_loss=0.06083, over 21240.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2997, pruned_loss=0.07069, over 4290258.49 frames. ], batch size: 176, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:51:02,876 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.312e+02 6.209e+02 8.156e+02 1.271e+03 2.290e+03, threshold=1.631e+03, percent-clipped=8.0 2023-06-26 13:51:05,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1582542.0, ans=0.0 2023-06-26 13:51:33,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1582602.0, ans=0.5 2023-06-26 13:52:11,137 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 13:52:40,911 INFO [train.py:996] (3/4) Epoch 9, batch 19850, loss[loss=0.1765, simple_loss=0.2562, pruned_loss=0.04841, over 21345.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2935, pruned_loss=0.06634, over 4286941.16 frames. ], batch size: 176, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:52:52,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1582842.0, ans=0.2 2023-06-26 13:54:07,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1583022.0, ans=0.125 2023-06-26 13:54:11,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1583082.0, ans=0.0 2023-06-26 13:54:16,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1583082.0, ans=0.125 2023-06-26 13:54:26,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1583142.0, ans=0.125 2023-06-26 13:54:27,762 INFO [train.py:996] (3/4) Epoch 9, batch 19900, loss[loss=0.1874, simple_loss=0.2603, pruned_loss=0.05726, over 21596.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2924, pruned_loss=0.06393, over 4287797.62 frames. ], batch size: 263, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:54:29,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.62 vs. limit=15.0 2023-06-26 13:54:34,750 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.163e+02 4.779e+02 6.020e+02 7.987e+02 2.016e+03, threshold=1.204e+03, percent-clipped=5.0 2023-06-26 13:54:58,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1583202.0, ans=0.125 2023-06-26 13:55:28,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1583262.0, ans=0.125 2023-06-26 13:56:18,951 INFO [train.py:996] (3/4) Epoch 9, batch 19950, loss[loss=0.1831, simple_loss=0.2662, pruned_loss=0.05005, over 21730.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2867, pruned_loss=0.06349, over 4281412.45 frames. ], batch size: 316, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 13:56:51,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1583502.0, ans=0.1 2023-06-26 13:57:07,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1583502.0, ans=0.1 2023-06-26 13:57:07,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1583502.0, ans=0.125 2023-06-26 13:57:21,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1583562.0, ans=0.2 2023-06-26 13:57:33,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1583622.0, ans=0.125 2023-06-26 13:58:06,795 INFO [train.py:996] (3/4) Epoch 9, batch 20000, loss[loss=0.2109, simple_loss=0.2839, pruned_loss=0.06891, over 21885.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2875, pruned_loss=0.06459, over 4275819.58 frames. ], batch size: 124, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 13:58:09,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1583742.0, ans=0.0 2023-06-26 13:58:19,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.477e+02 4.524e+02 6.104e+02 8.785e+02 2.084e+03, threshold=1.221e+03, percent-clipped=7.0 2023-06-26 13:58:37,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1583802.0, ans=0.0 2023-06-26 13:59:24,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1583922.0, ans=0.125 2023-06-26 13:59:30,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1583922.0, ans=0.125 2023-06-26 13:59:56,249 INFO [train.py:996] (3/4) Epoch 9, batch 20050, loss[loss=0.201, simple_loss=0.2842, pruned_loss=0.05886, over 21810.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2899, pruned_loss=0.06668, over 4287562.86 frames. ], batch size: 282, lr: 3.26e-03, grad_scale: 32.0 2023-06-26 14:00:08,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1584042.0, ans=0.0 2023-06-26 14:00:16,796 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-26 14:01:51,567 INFO [train.py:996] (3/4) Epoch 9, batch 20100, loss[loss=0.2463, simple_loss=0.34, pruned_loss=0.07628, over 21813.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2918, pruned_loss=0.06874, over 4282856.97 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 16.0 2023-06-26 14:02:00,511 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.881e+02 4.985e+02 7.812e+02 1.091e+03 2.146e+03, threshold=1.562e+03, percent-clipped=15.0 2023-06-26 14:02:05,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1584342.0, ans=0.125 2023-06-26 14:03:13,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.10 vs. limit=15.0 2023-06-26 14:03:43,011 INFO [train.py:996] (3/4) Epoch 9, batch 20150, loss[loss=0.2677, simple_loss=0.3394, pruned_loss=0.09801, over 21696.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3, pruned_loss=0.07142, over 4283956.41 frames. ], batch size: 351, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:05:06,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1584822.0, ans=0.125 2023-06-26 14:05:53,457 INFO [train.py:996] (3/4) Epoch 9, batch 20200, loss[loss=0.2612, simple_loss=0.3635, pruned_loss=0.07945, over 21673.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3073, pruned_loss=0.07422, over 4282780.98 frames. ], batch size: 389, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:06:02,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.033e+02 6.140e+02 1.031e+03 1.445e+03 3.124e+03, threshold=2.061e+03, percent-clipped=23.0 2023-06-26 14:06:58,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1585122.0, ans=0.125 2023-06-26 14:07:08,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1585182.0, ans=0.125 2023-06-26 14:07:41,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.67 vs. limit=15.0 2023-06-26 14:07:43,936 INFO [train.py:996] (3/4) Epoch 9, batch 20250, loss[loss=0.2037, simple_loss=0.2963, pruned_loss=0.05548, over 21827.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.307, pruned_loss=0.07273, over 4276566.70 frames. ], batch size: 282, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:07:53,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1585242.0, ans=0.125 2023-06-26 14:07:56,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1585242.0, ans=0.1 2023-06-26 14:08:07,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-06-26 14:08:44,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.27 vs. limit=5.0 2023-06-26 14:09:05,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1585482.0, ans=0.0 2023-06-26 14:09:26,862 INFO [train.py:996] (3/4) Epoch 9, batch 20300, loss[loss=0.1837, simple_loss=0.2598, pruned_loss=0.0538, over 21897.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3036, pruned_loss=0.07033, over 4259485.31 frames. ], batch size: 98, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:09:35,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.339e+02 4.853e+02 6.521e+02 1.002e+03 2.689e+03, threshold=1.304e+03, percent-clipped=1.0 2023-06-26 14:10:04,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-26 14:10:17,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1585662.0, ans=0.125 2023-06-26 14:10:21,484 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.08 vs. limit=10.0 2023-06-26 14:10:35,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1585722.0, ans=0.125 2023-06-26 14:11:13,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1585782.0, ans=0.125 2023-06-26 14:11:15,936 INFO [train.py:996] (3/4) Epoch 9, batch 20350, loss[loss=0.2485, simple_loss=0.3314, pruned_loss=0.08278, over 21363.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3036, pruned_loss=0.07047, over 4266103.02 frames. ], batch size: 131, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:11:27,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1585842.0, ans=0.0 2023-06-26 14:11:27,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1585842.0, ans=0.125 2023-06-26 14:11:32,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1585902.0, ans=0.125 2023-06-26 14:11:41,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1585902.0, ans=0.0 2023-06-26 14:12:01,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1585962.0, ans=10.0 2023-06-26 14:12:05,883 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.21 vs. limit=15.0 2023-06-26 14:12:42,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1586082.0, ans=0.0 2023-06-26 14:13:04,287 INFO [train.py:996] (3/4) Epoch 9, batch 20400, loss[loss=0.2339, simple_loss=0.3155, pruned_loss=0.07611, over 21937.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3065, pruned_loss=0.07269, over 4262484.02 frames. ], batch size: 316, lr: 3.25e-03, grad_scale: 32.0 2023-06-26 14:13:13,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.291e+02 5.756e+02 8.261e+02 1.227e+03 2.104e+03, threshold=1.652e+03, percent-clipped=22.0 2023-06-26 14:13:13,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1586142.0, ans=0.125 2023-06-26 14:13:22,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1586202.0, ans=0.125 2023-06-26 14:13:29,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-26 14:13:59,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1586262.0, ans=0.0 2023-06-26 14:14:52,319 INFO [train.py:996] (3/4) Epoch 9, batch 20450, loss[loss=0.249, simple_loss=0.3046, pruned_loss=0.09668, over 21580.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3074, pruned_loss=0.07502, over 4265958.33 frames. ], batch size: 507, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:14:57,690 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:15:39,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1586562.0, ans=0.0 2023-06-26 14:16:33,720 INFO [train.py:996] (3/4) Epoch 9, batch 20500, loss[loss=0.1941, simple_loss=0.2652, pruned_loss=0.06153, over 21681.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3033, pruned_loss=0.07459, over 4261611.90 frames. ], batch size: 282, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:16:44,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.954e+02 5.491e+02 7.367e+02 1.069e+03 2.836e+03, threshold=1.473e+03, percent-clipped=8.0 2023-06-26 14:17:04,258 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-26 14:17:13,015 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2023-06-26 14:17:38,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1586922.0, ans=0.0 2023-06-26 14:18:18,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1586982.0, ans=0.2 2023-06-26 14:18:21,690 INFO [train.py:996] (3/4) Epoch 9, batch 20550, loss[loss=0.2048, simple_loss=0.2791, pruned_loss=0.06524, over 21263.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2958, pruned_loss=0.07296, over 4255959.95 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:18:27,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1587042.0, ans=0.0 2023-06-26 14:19:14,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1587162.0, ans=0.125 2023-06-26 14:20:09,348 INFO [train.py:996] (3/4) Epoch 9, batch 20600, loss[loss=0.2053, simple_loss=0.3043, pruned_loss=0.05317, over 20713.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2988, pruned_loss=0.07174, over 4250790.52 frames. ], batch size: 607, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:20:14,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1587342.0, ans=0.0 2023-06-26 14:20:19,842 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.630e+02 4.988e+02 6.640e+02 9.393e+02 1.385e+03, threshold=1.328e+03, percent-clipped=0.0 2023-06-26 14:20:30,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1587402.0, ans=0.0 2023-06-26 14:20:42,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1587402.0, ans=0.0 2023-06-26 14:21:00,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1587462.0, ans=0.0 2023-06-26 14:21:26,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1587522.0, ans=0.125 2023-06-26 14:21:56,971 INFO [train.py:996] (3/4) Epoch 9, batch 20650, loss[loss=0.1996, simple_loss=0.2663, pruned_loss=0.06645, over 21266.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2945, pruned_loss=0.07178, over 4254750.81 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:22:01,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-26 14:23:01,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1587822.0, ans=0.125 2023-06-26 14:23:45,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.30 vs. limit=15.0 2023-06-26 14:23:47,397 INFO [train.py:996] (3/4) Epoch 9, batch 20700, loss[loss=0.1972, simple_loss=0.2807, pruned_loss=0.0569, over 21650.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2878, pruned_loss=0.06902, over 4258749.92 frames. ], batch size: 414, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:23:58,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.316e+02 4.993e+02 7.910e+02 1.068e+03 1.993e+03, threshold=1.582e+03, percent-clipped=12.0 2023-06-26 14:24:01,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1587942.0, ans=0.2 2023-06-26 14:24:21,710 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:25:23,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1588182.0, ans=0.125 2023-06-26 14:25:26,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1588182.0, ans=0.0 2023-06-26 14:25:33,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=23.05 vs. limit=15.0 2023-06-26 14:25:38,410 INFO [train.py:996] (3/4) Epoch 9, batch 20750, loss[loss=0.3347, simple_loss=0.4166, pruned_loss=0.1264, over 21425.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2927, pruned_loss=0.06925, over 4260053.53 frames. ], batch size: 507, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:25:47,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1588242.0, ans=0.05 2023-06-26 14:26:35,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1588362.0, ans=0.2 2023-06-26 14:26:39,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1588362.0, ans=0.125 2023-06-26 14:26:47,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1588422.0, ans=0.125 2023-06-26 14:27:17,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1588482.0, ans=0.125 2023-06-26 14:27:32,210 INFO [train.py:996] (3/4) Epoch 9, batch 20800, loss[loss=0.1928, simple_loss=0.2685, pruned_loss=0.05856, over 21564.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2917, pruned_loss=0.06901, over 4259400.41 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 32.0 2023-06-26 14:27:41,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1588542.0, ans=0.05 2023-06-26 14:27:42,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.373e+02 6.315e+02 8.167e+02 1.529e+03 3.332e+03, threshold=1.633e+03, percent-clipped=23.0 2023-06-26 14:28:14,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1588662.0, ans=0.125 2023-06-26 14:29:19,945 INFO [train.py:996] (3/4) Epoch 9, batch 20850, loss[loss=0.2022, simple_loss=0.2645, pruned_loss=0.06999, over 21453.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2836, pruned_loss=0.06659, over 4258608.28 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:29:40,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1588902.0, ans=0.1 2023-06-26 14:30:13,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-26 14:30:22,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1589022.0, ans=0.125 2023-06-26 14:30:35,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1589022.0, ans=0.125 2023-06-26 14:30:39,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-26 14:30:49,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1589082.0, ans=0.5 2023-06-26 14:30:52,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1589082.0, ans=0.125 2023-06-26 14:31:08,540 INFO [train.py:996] (3/4) Epoch 9, batch 20900, loss[loss=0.1932, simple_loss=0.2717, pruned_loss=0.05739, over 21538.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2854, pruned_loss=0.06783, over 4256785.28 frames. ], batch size: 195, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:31:13,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1589142.0, ans=0.125 2023-06-26 14:31:20,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.215e+02 4.594e+02 6.029e+02 1.010e+03 2.105e+03, threshold=1.206e+03, percent-clipped=4.0 2023-06-26 14:31:22,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1589142.0, ans=0.125 2023-06-26 14:31:29,719 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-26 14:31:30,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1589202.0, ans=0.125 2023-06-26 14:32:19,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1589322.0, ans=0.2 2023-06-26 14:32:29,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1589322.0, ans=0.0 2023-06-26 14:32:30,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1589382.0, ans=0.1 2023-06-26 14:32:48,521 INFO [train.py:996] (3/4) Epoch 9, batch 20950, loss[loss=0.2554, simple_loss=0.3833, pruned_loss=0.06369, over 19701.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2827, pruned_loss=0.06496, over 4254251.92 frames. ], batch size: 702, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:32:57,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1589442.0, ans=0.125 2023-06-26 14:34:07,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1589622.0, ans=0.1 2023-06-26 14:34:19,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1589682.0, ans=10.0 2023-06-26 14:34:36,211 INFO [train.py:996] (3/4) Epoch 9, batch 21000, loss[loss=0.2306, simple_loss=0.2971, pruned_loss=0.08201, over 21640.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2849, pruned_loss=0.06583, over 4247096.23 frames. ], batch size: 471, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:34:36,212 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 14:34:52,545 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.0251, 2.4602, 2.3895, 3.0419, 1.6836, 2.7777, 2.8143, 2.0973], device='cuda:3') 2023-06-26 14:34:59,720 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2612, simple_loss=0.3587, pruned_loss=0.0819, over 1796401.00 frames. 2023-06-26 14:34:59,721 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 14:35:08,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-26 14:35:09,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1589742.0, ans=0.125 2023-06-26 14:35:11,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.378e+02 4.892e+02 7.035e+02 1.069e+03 1.759e+03, threshold=1.407e+03, percent-clipped=17.0 2023-06-26 14:35:50,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-26 14:36:02,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1589922.0, ans=0.125 2023-06-26 14:36:15,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1589922.0, ans=0.0 2023-06-26 14:36:49,884 INFO [train.py:996] (3/4) Epoch 9, batch 21050, loss[loss=0.1841, simple_loss=0.276, pruned_loss=0.04606, over 19882.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2835, pruned_loss=0.06629, over 4254090.98 frames. ], batch size: 703, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:37:11,167 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 14:37:11,758 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-26 14:38:36,803 INFO [train.py:996] (3/4) Epoch 9, batch 21100, loss[loss=0.1875, simple_loss=0.253, pruned_loss=0.061, over 21605.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2792, pruned_loss=0.06538, over 4255582.65 frames. ], batch size: 247, lr: 3.25e-03, grad_scale: 8.0 2023-06-26 14:38:39,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1590342.0, ans=0.0 2023-06-26 14:38:50,943 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.682e+02 5.080e+02 7.538e+02 1.007e+03 2.026e+03, threshold=1.508e+03, percent-clipped=9.0 2023-06-26 14:39:27,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1590462.0, ans=0.125 2023-06-26 14:39:38,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1590522.0, ans=0.1 2023-06-26 14:40:25,042 INFO [train.py:996] (3/4) Epoch 9, batch 21150, loss[loss=0.1953, simple_loss=0.2566, pruned_loss=0.06697, over 21240.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2753, pruned_loss=0.06574, over 4257739.28 frames. ], batch size: 144, lr: 3.25e-03, grad_scale: 8.0 2023-06-26 14:40:42,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1590702.0, ans=0.125 2023-06-26 14:40:42,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-26 14:41:10,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1590762.0, ans=0.0 2023-06-26 14:41:11,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=1590762.0, ans=0.1 2023-06-26 14:41:42,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1590822.0, ans=0.125 2023-06-26 14:41:45,030 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.92 vs. limit=10.0 2023-06-26 14:41:51,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-06-26 14:42:12,171 INFO [train.py:996] (3/4) Epoch 9, batch 21200, loss[loss=0.1727, simple_loss=0.2494, pruned_loss=0.04797, over 21707.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2709, pruned_loss=0.06448, over 4255662.15 frames. ], batch size: 298, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:42:17,347 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=12.0 2023-06-26 14:42:25,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1590942.0, ans=0.125 2023-06-26 14:42:26,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.202e+02 4.962e+02 6.952e+02 8.758e+02 1.783e+03, threshold=1.390e+03, percent-clipped=2.0 2023-06-26 14:42:32,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1591002.0, ans=0.0 2023-06-26 14:43:29,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-26 14:43:56,767 INFO [train.py:996] (3/4) Epoch 9, batch 21250, loss[loss=0.2241, simple_loss=0.3062, pruned_loss=0.07099, over 21599.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2692, pruned_loss=0.06392, over 4255556.01 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:44:01,337 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=22.5 2023-06-26 14:44:23,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1591302.0, ans=0.0 2023-06-26 14:45:33,952 INFO [train.py:996] (3/4) Epoch 9, batch 21300, loss[loss=0.2167, simple_loss=0.2848, pruned_loss=0.07432, over 21607.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2767, pruned_loss=0.06664, over 4255919.07 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:45:47,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1591542.0, ans=0.0 2023-06-26 14:45:52,769 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.243e+02 5.570e+02 8.003e+02 1.129e+03 3.066e+03, threshold=1.601e+03, percent-clipped=15.0 2023-06-26 14:45:54,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1591602.0, ans=0.0 2023-06-26 14:46:27,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1591662.0, ans=0.125 2023-06-26 14:47:23,323 INFO [train.py:996] (3/4) Epoch 9, batch 21350, loss[loss=0.2034, simple_loss=0.2971, pruned_loss=0.05483, over 21794.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2799, pruned_loss=0.06657, over 4255707.07 frames. ], batch size: 298, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:47:32,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1591842.0, ans=0.125 2023-06-26 14:48:15,465 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-26 14:48:23,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1591962.0, ans=0.025 2023-06-26 14:48:59,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1592082.0, ans=0.125 2023-06-26 14:49:00,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1592082.0, ans=0.125 2023-06-26 14:49:12,013 INFO [train.py:996] (3/4) Epoch 9, batch 21400, loss[loss=0.2056, simple_loss=0.2761, pruned_loss=0.0676, over 21718.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2831, pruned_loss=0.06619, over 4266954.84 frames. ], batch size: 112, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:49:18,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1592142.0, ans=0.2 2023-06-26 14:49:26,001 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.278e+02 4.706e+02 6.583e+02 9.880e+02 2.077e+03, threshold=1.317e+03, percent-clipped=4.0 2023-06-26 14:49:47,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1592202.0, ans=0.0 2023-06-26 14:50:41,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1592382.0, ans=0.1 2023-06-26 14:50:54,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.13 vs. limit=22.5 2023-06-26 14:50:57,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1592382.0, ans=0.125 2023-06-26 14:51:00,486 INFO [train.py:996] (3/4) Epoch 9, batch 21450, loss[loss=0.2182, simple_loss=0.2861, pruned_loss=0.07514, over 21500.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2866, pruned_loss=0.06786, over 4266657.89 frames. ], batch size: 194, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:52:18,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1592622.0, ans=0.1 2023-06-26 14:52:43,808 INFO [train.py:996] (3/4) Epoch 9, batch 21500, loss[loss=0.1784, simple_loss=0.2457, pruned_loss=0.05554, over 21541.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2854, pruned_loss=0.06861, over 4253107.46 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:52:55,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-26 14:53:03,330 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.822e+02 5.893e+02 8.169e+02 1.189e+03 2.218e+03, threshold=1.634e+03, percent-clipped=19.0 2023-06-26 14:53:05,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1592802.0, ans=0.125 2023-06-26 14:53:17,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1592802.0, ans=0.125 2023-06-26 14:53:42,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1592862.0, ans=0.125 2023-06-26 14:54:06,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1592922.0, ans=0.125 2023-06-26 14:54:32,477 INFO [train.py:996] (3/4) Epoch 9, batch 21550, loss[loss=0.19, simple_loss=0.2627, pruned_loss=0.05869, over 21835.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2788, pruned_loss=0.06641, over 4258608.68 frames. ], batch size: 98, lr: 3.25e-03, grad_scale: 8.0 2023-06-26 14:55:17,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1593162.0, ans=0.125 2023-06-26 14:55:17,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1593162.0, ans=0.125 2023-06-26 14:55:29,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1593162.0, ans=0.125 2023-06-26 14:55:54,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1593222.0, ans=0.2 2023-06-26 14:56:14,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1593282.0, ans=0.2 2023-06-26 14:56:26,270 INFO [train.py:996] (3/4) Epoch 9, batch 21600, loss[loss=0.1544, simple_loss=0.2262, pruned_loss=0.04128, over 21472.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2786, pruned_loss=0.06588, over 4257183.45 frames. ], batch size: 212, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:56:32,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1593342.0, ans=0.125 2023-06-26 14:56:46,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1593342.0, ans=0.0 2023-06-26 14:56:53,154 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.217e+02 4.926e+02 7.373e+02 9.794e+02 2.336e+03, threshold=1.475e+03, percent-clipped=12.0 2023-06-26 14:57:10,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-26 14:57:43,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1593522.0, ans=0.125 2023-06-26 14:57:46,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1593522.0, ans=0.1 2023-06-26 14:57:48,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1593522.0, ans=0.125 2023-06-26 14:58:05,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1593582.0, ans=0.0 2023-06-26 14:58:15,068 INFO [train.py:996] (3/4) Epoch 9, batch 21650, loss[loss=0.1955, simple_loss=0.2956, pruned_loss=0.04765, over 21582.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2817, pruned_loss=0.0636, over 4256993.14 frames. ], batch size: 230, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 14:58:30,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1593642.0, ans=0.07 2023-06-26 14:59:17,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1593762.0, ans=0.125 2023-06-26 14:59:41,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1593822.0, ans=0.125 2023-06-26 14:59:48,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1593882.0, ans=0.0 2023-06-26 14:59:55,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-26 15:00:01,564 INFO [train.py:996] (3/4) Epoch 9, batch 21700, loss[loss=0.2421, simple_loss=0.2891, pruned_loss=0.09752, over 21302.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2816, pruned_loss=0.06195, over 4257175.40 frames. ], batch size: 507, lr: 3.25e-03, grad_scale: 16.0 2023-06-26 15:00:17,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1593942.0, ans=0.125 2023-06-26 15:00:22,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.421e+02 4.737e+02 7.563e+02 1.159e+03 3.422e+03, threshold=1.513e+03, percent-clipped=14.0 2023-06-26 15:00:25,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1594002.0, ans=0.125 2023-06-26 15:00:43,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1594002.0, ans=0.125 2023-06-26 15:00:43,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1594002.0, ans=0.0 2023-06-26 15:01:15,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.46 vs. limit=15.0 2023-06-26 15:01:47,623 INFO [train.py:996] (3/4) Epoch 9, batch 21750, loss[loss=0.1864, simple_loss=0.2443, pruned_loss=0.06432, over 21208.00 frames. ], tot_loss[loss=0.2, simple_loss=0.277, pruned_loss=0.06149, over 4250290.43 frames. ], batch size: 548, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:02:14,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1594302.0, ans=0.0 2023-06-26 15:03:35,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1594482.0, ans=0.1 2023-06-26 15:03:38,353 INFO [train.py:996] (3/4) Epoch 9, batch 21800, loss[loss=0.2302, simple_loss=0.2748, pruned_loss=0.09279, over 21401.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2765, pruned_loss=0.06277, over 4251668.99 frames. ], batch size: 509, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:03:54,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1594542.0, ans=0.1 2023-06-26 15:04:04,207 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.457e+02 4.843e+02 6.619e+02 9.442e+02 2.103e+03, threshold=1.324e+03, percent-clipped=2.0 2023-06-26 15:04:16,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1594602.0, ans=0.0 2023-06-26 15:05:09,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-26 15:05:09,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-26 15:05:15,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1594782.0, ans=0.125 2023-06-26 15:05:25,816 INFO [train.py:996] (3/4) Epoch 9, batch 21850, loss[loss=0.2522, simple_loss=0.3172, pruned_loss=0.09358, over 21692.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2813, pruned_loss=0.06365, over 4255176.80 frames. ], batch size: 507, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:05:42,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.32 vs. limit=12.0 2023-06-26 15:06:00,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1594902.0, ans=0.07 2023-06-26 15:06:25,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-26 15:06:48,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-26 15:07:12,456 INFO [train.py:996] (3/4) Epoch 9, batch 21900, loss[loss=0.2351, simple_loss=0.3034, pruned_loss=0.08343, over 21692.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2812, pruned_loss=0.06423, over 4262381.18 frames. ], batch size: 389, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:07:38,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.466e+02 4.571e+02 6.004e+02 8.081e+02 1.811e+03, threshold=1.201e+03, percent-clipped=9.0 2023-06-26 15:08:11,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1595262.0, ans=0.125 2023-06-26 15:08:35,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1595322.0, ans=0.5 2023-06-26 15:09:04,622 INFO [train.py:996] (3/4) Epoch 9, batch 21950, loss[loss=0.2745, simple_loss=0.373, pruned_loss=0.08802, over 19733.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2771, pruned_loss=0.06421, over 4246572.45 frames. ], batch size: 702, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:10:14,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=12.0 2023-06-26 15:10:54,227 INFO [train.py:996] (3/4) Epoch 9, batch 22000, loss[loss=0.1993, simple_loss=0.2755, pruned_loss=0.06153, over 21875.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2715, pruned_loss=0.06182, over 4255438.28 frames. ], batch size: 107, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:11:15,753 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.199e+02 4.453e+02 7.165e+02 9.999e+02 1.931e+03, threshold=1.433e+03, percent-clipped=13.0 2023-06-26 15:11:27,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=15.0 2023-06-26 15:12:07,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1595922.0, ans=0.125 2023-06-26 15:12:49,930 INFO [train.py:996] (3/4) Epoch 9, batch 22050, loss[loss=0.1953, simple_loss=0.2744, pruned_loss=0.05814, over 16348.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2756, pruned_loss=0.06386, over 4253550.30 frames. ], batch size: 61, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:13:35,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1596162.0, ans=0.1 2023-06-26 15:13:37,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1596162.0, ans=0.125 2023-06-26 15:13:47,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1596162.0, ans=0.125 2023-06-26 15:14:38,916 INFO [train.py:996] (3/4) Epoch 9, batch 22100, loss[loss=0.2085, simple_loss=0.2791, pruned_loss=0.06894, over 21813.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2871, pruned_loss=0.06833, over 4246360.78 frames. ], batch size: 247, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:14:41,993 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-26 15:14:44,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1596342.0, ans=0.0 2023-06-26 15:14:46,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1596342.0, ans=0.125 2023-06-26 15:14:56,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.930e+02 6.282e+02 9.612e+02 1.455e+03 3.538e+03, threshold=1.922e+03, percent-clipped=29.0 2023-06-26 15:15:11,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1596402.0, ans=0.2 2023-06-26 15:15:16,845 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 15:15:37,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1596462.0, ans=0.125 2023-06-26 15:16:26,376 INFO [train.py:996] (3/4) Epoch 9, batch 22150, loss[loss=0.2316, simple_loss=0.3022, pruned_loss=0.08053, over 21902.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2895, pruned_loss=0.06979, over 4260763.99 frames. ], batch size: 351, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:16:28,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1596642.0, ans=0.125 2023-06-26 15:16:32,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1596642.0, ans=0.0 2023-06-26 15:17:04,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1596702.0, ans=0.0 2023-06-26 15:17:40,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1596822.0, ans=0.2 2023-06-26 15:18:05,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-26 15:18:14,954 INFO [train.py:996] (3/4) Epoch 9, batch 22200, loss[loss=0.1965, simple_loss=0.2889, pruned_loss=0.05207, over 21633.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2909, pruned_loss=0.07107, over 4276482.57 frames. ], batch size: 230, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:18:24,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1596942.0, ans=0.1 2023-06-26 15:18:32,781 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.802e+02 5.060e+02 7.082e+02 1.053e+03 2.242e+03, threshold=1.416e+03, percent-clipped=3.0 2023-06-26 15:18:36,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1597002.0, ans=0.2 2023-06-26 15:19:33,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1597122.0, ans=0.0 2023-06-26 15:20:04,045 INFO [train.py:996] (3/4) Epoch 9, batch 22250, loss[loss=0.2571, simple_loss=0.3356, pruned_loss=0.08925, over 21370.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2976, pruned_loss=0.07195, over 4280257.41 frames. ], batch size: 143, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:20:10,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=22.5 2023-06-26 15:20:12,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1597242.0, ans=0.125 2023-06-26 15:20:59,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1597362.0, ans=0.1 2023-06-26 15:21:01,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1597362.0, ans=0.0 2023-06-26 15:21:15,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1597422.0, ans=0.125 2023-06-26 15:21:51,028 INFO [train.py:996] (3/4) Epoch 9, batch 22300, loss[loss=0.2263, simple_loss=0.2911, pruned_loss=0.08073, over 21311.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2987, pruned_loss=0.07322, over 4285715.03 frames. ], batch size: 159, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:22:08,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.561e+02 5.384e+02 7.516e+02 1.079e+03 3.010e+03, threshold=1.503e+03, percent-clipped=16.0 2023-06-26 15:22:38,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1597662.0, ans=0.125 2023-06-26 15:22:38,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1597662.0, ans=0.125 2023-06-26 15:22:40,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1597662.0, ans=0.04949747468305833 2023-06-26 15:22:50,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1597662.0, ans=0.0 2023-06-26 15:22:51,539 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.06 vs. limit=15.0 2023-06-26 15:23:25,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1597782.0, ans=0.0 2023-06-26 15:23:33,859 INFO [train.py:996] (3/4) Epoch 9, batch 22350, loss[loss=0.2539, simple_loss=0.3177, pruned_loss=0.09504, over 21699.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2974, pruned_loss=0.07367, over 4286023.07 frames. ], batch size: 508, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:23:59,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1597902.0, ans=0.0 2023-06-26 15:23:59,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1597902.0, ans=0.04949747468305833 2023-06-26 15:24:01,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=22.5 2023-06-26 15:24:41,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1598022.0, ans=0.0 2023-06-26 15:25:13,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1598082.0, ans=0.125 2023-06-26 15:25:21,635 INFO [train.py:996] (3/4) Epoch 9, batch 22400, loss[loss=0.2073, simple_loss=0.2876, pruned_loss=0.06351, over 21669.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2934, pruned_loss=0.07032, over 4289701.92 frames. ], batch size: 332, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:25:49,400 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.723e+02 5.104e+02 6.690e+02 9.796e+02 2.008e+03, threshold=1.338e+03, percent-clipped=2.0 2023-06-26 15:26:12,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1598262.0, ans=0.05 2023-06-26 15:26:40,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1598322.0, ans=0.125 2023-06-26 15:26:55,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1598382.0, ans=0.125 2023-06-26 15:26:57,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.67 vs. limit=15.0 2023-06-26 15:27:14,427 INFO [train.py:996] (3/4) Epoch 9, batch 22450, loss[loss=0.2122, simple_loss=0.2712, pruned_loss=0.0766, over 21813.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2871, pruned_loss=0.06976, over 4282470.86 frames. ], batch size: 98, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:27:49,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1598502.0, ans=0.125 2023-06-26 15:28:15,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1598622.0, ans=0.125 2023-06-26 15:28:40,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1598682.0, ans=0.125 2023-06-26 15:29:02,870 INFO [train.py:996] (3/4) Epoch 9, batch 22500, loss[loss=0.2047, simple_loss=0.2899, pruned_loss=0.05973, over 21320.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2835, pruned_loss=0.06944, over 4285699.07 frames. ], batch size: 194, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:29:17,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1598742.0, ans=0.1 2023-06-26 15:29:17,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1598742.0, ans=0.04949747468305833 2023-06-26 15:29:26,966 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 5.166e+02 7.858e+02 1.138e+03 3.264e+03, threshold=1.572e+03, percent-clipped=12.0 2023-06-26 15:29:36,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1598802.0, ans=0.0 2023-06-26 15:30:00,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1598862.0, ans=0.0 2023-06-26 15:30:11,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1598922.0, ans=0.125 2023-06-26 15:30:18,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1598922.0, ans=0.0 2023-06-26 15:30:20,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1598922.0, ans=0.125 2023-06-26 15:30:28,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1598982.0, ans=0.0 2023-06-26 15:30:57,515 INFO [train.py:996] (3/4) Epoch 9, batch 22550, loss[loss=0.2445, simple_loss=0.3278, pruned_loss=0.08064, over 21785.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2883, pruned_loss=0.0706, over 4285330.81 frames. ], batch size: 414, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:31:26,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1599102.0, ans=0.125 2023-06-26 15:31:55,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1599162.0, ans=0.2 2023-06-26 15:32:09,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1599222.0, ans=0.1 2023-06-26 15:32:49,141 INFO [train.py:996] (3/4) Epoch 9, batch 22600, loss[loss=0.2281, simple_loss=0.3108, pruned_loss=0.07269, over 21737.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2901, pruned_loss=0.07019, over 4290798.87 frames. ], batch size: 298, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:33:08,789 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.130e+02 6.886e+02 1.082e+03 1.570e+03 3.521e+03, threshold=2.164e+03, percent-clipped=24.0 2023-06-26 15:33:38,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1599462.0, ans=0.125 2023-06-26 15:34:17,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1599522.0, ans=0.0 2023-06-26 15:34:26,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1599582.0, ans=0.0 2023-06-26 15:34:37,815 INFO [train.py:996] (3/4) Epoch 9, batch 22650, loss[loss=0.2092, simple_loss=0.2796, pruned_loss=0.0694, over 21741.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2874, pruned_loss=0.07015, over 4262074.92 frames. ], batch size: 112, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:35:58,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1599822.0, ans=0.0 2023-06-26 15:36:13,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1599882.0, ans=0.125 2023-06-26 15:36:22,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1599882.0, ans=0.125 2023-06-26 15:36:24,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.69 vs. limit=5.0 2023-06-26 15:36:24,831 INFO [train.py:996] (3/4) Epoch 9, batch 22700, loss[loss=0.2022, simple_loss=0.2598, pruned_loss=0.07233, over 21628.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2809, pruned_loss=0.06963, over 4263555.11 frames. ], batch size: 298, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:36:36,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1599942.0, ans=0.1 2023-06-26 15:36:44,327 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.472e+02 5.506e+02 7.412e+02 1.059e+03 2.032e+03, threshold=1.482e+03, percent-clipped=0.0 2023-06-26 15:36:59,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1600002.0, ans=0.0 2023-06-26 15:37:14,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-26 15:37:20,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1600062.0, ans=0.125 2023-06-26 15:37:27,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2023-06-26 15:37:50,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1600122.0, ans=0.1 2023-06-26 15:37:52,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1600122.0, ans=0.2 2023-06-26 15:38:05,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1600182.0, ans=0.1 2023-06-26 15:38:11,795 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-26 15:38:13,886 INFO [train.py:996] (3/4) Epoch 9, batch 22750, loss[loss=0.2386, simple_loss=0.3065, pruned_loss=0.08536, over 21758.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2824, pruned_loss=0.07095, over 4264844.61 frames. ], batch size: 332, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:38:21,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1600242.0, ans=0.125 2023-06-26 15:38:48,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1600302.0, ans=0.1 2023-06-26 15:38:50,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1600302.0, ans=0.125 2023-06-26 15:39:08,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1600362.0, ans=0.125 2023-06-26 15:39:22,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1600422.0, ans=0.1 2023-06-26 15:40:01,400 INFO [train.py:996] (3/4) Epoch 9, batch 22800, loss[loss=0.2264, simple_loss=0.293, pruned_loss=0.07991, over 21877.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2881, pruned_loss=0.07283, over 4272782.10 frames. ], batch size: 351, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:40:06,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1600542.0, ans=0.125 2023-06-26 15:40:06,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1600542.0, ans=0.0 2023-06-26 15:40:16,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-26 15:40:23,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1600602.0, ans=0.0 2023-06-26 15:40:28,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.571e+02 5.339e+02 7.756e+02 1.140e+03 2.355e+03, threshold=1.551e+03, percent-clipped=14.0 2023-06-26 15:40:54,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1600662.0, ans=0.125 2023-06-26 15:41:05,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-26 15:41:49,509 INFO [train.py:996] (3/4) Epoch 9, batch 22850, loss[loss=0.1951, simple_loss=0.2618, pruned_loss=0.0642, over 21501.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2848, pruned_loss=0.07231, over 4274050.93 frames. ], batch size: 230, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:41:56,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1600842.0, ans=0.125 2023-06-26 15:42:28,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-26 15:42:32,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1600962.0, ans=0.5 2023-06-26 15:42:52,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1601022.0, ans=0.125 2023-06-26 15:43:20,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1601082.0, ans=0.0 2023-06-26 15:43:37,594 INFO [train.py:996] (3/4) Epoch 9, batch 22900, loss[loss=0.2274, simple_loss=0.3027, pruned_loss=0.07607, over 21248.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2855, pruned_loss=0.07176, over 4258309.33 frames. ], batch size: 548, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:44:04,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.622e+02 6.448e+02 8.997e+02 1.321e+03 2.993e+03, threshold=1.799e+03, percent-clipped=19.0 2023-06-26 15:44:05,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1601202.0, ans=0.125 2023-06-26 15:44:12,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1601202.0, ans=0.125 2023-06-26 15:44:19,577 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-26 15:45:08,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2023-06-26 15:45:28,327 INFO [train.py:996] (3/4) Epoch 9, batch 22950, loss[loss=0.2002, simple_loss=0.2515, pruned_loss=0.07449, over 20339.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2971, pruned_loss=0.0698, over 4262018.02 frames. ], batch size: 703, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:45:32,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1601442.0, ans=0.125 2023-06-26 15:46:23,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1601562.0, ans=0.1 2023-06-26 15:46:23,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1601562.0, ans=0.2 2023-06-26 15:46:33,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1601562.0, ans=0.0 2023-06-26 15:46:36,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1601622.0, ans=0.0 2023-06-26 15:47:10,771 INFO [train.py:996] (3/4) Epoch 9, batch 23000, loss[loss=0.2052, simple_loss=0.2811, pruned_loss=0.06468, over 21850.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2983, pruned_loss=0.06813, over 4272076.20 frames. ], batch size: 298, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:47:42,625 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.591e+02 4.526e+02 6.178e+02 9.113e+02 2.510e+03, threshold=1.236e+03, percent-clipped=4.0 2023-06-26 15:48:02,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1601862.0, ans=0.015 2023-06-26 15:48:22,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1601862.0, ans=0.0 2023-06-26 15:48:23,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2023-06-26 15:48:32,325 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-26 15:49:11,846 INFO [train.py:996] (3/4) Epoch 9, batch 23050, loss[loss=0.2777, simple_loss=0.3353, pruned_loss=0.1101, over 21474.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2991, pruned_loss=0.06985, over 4278153.46 frames. ], batch size: 471, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:49:15,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1602042.0, ans=0.0 2023-06-26 15:49:38,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1602102.0, ans=0.0 2023-06-26 15:49:40,686 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=12.0 2023-06-26 15:49:57,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1602162.0, ans=0.0 2023-06-26 15:50:17,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1602222.0, ans=0.0 2023-06-26 15:50:54,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1602342.0, ans=0.0 2023-06-26 15:50:55,765 INFO [train.py:996] (3/4) Epoch 9, batch 23100, loss[loss=0.1894, simple_loss=0.2465, pruned_loss=0.06616, over 20721.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2952, pruned_loss=0.0705, over 4275231.08 frames. ], batch size: 608, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:51:01,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1602342.0, ans=0.0 2023-06-26 15:51:22,037 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.578e+02 4.836e+02 6.120e+02 9.547e+02 2.287e+03, threshold=1.224e+03, percent-clipped=14.0 2023-06-26 15:52:44,510 INFO [train.py:996] (3/4) Epoch 9, batch 23150, loss[loss=0.2075, simple_loss=0.2685, pruned_loss=0.07326, over 21340.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2888, pruned_loss=0.06913, over 4276431.52 frames. ], batch size: 159, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:53:26,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1602702.0, ans=0.125 2023-06-26 15:54:07,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1602882.0, ans=0.125 2023-06-26 15:54:25,659 INFO [train.py:996] (3/4) Epoch 9, batch 23200, loss[loss=0.216, simple_loss=0.2869, pruned_loss=0.07257, over 20148.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2891, pruned_loss=0.07036, over 4285328.99 frames. ], batch size: 703, lr: 3.24e-03, grad_scale: 32.0 2023-06-26 15:54:28,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1602942.0, ans=0.125 2023-06-26 15:54:57,767 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.454e+02 5.096e+02 6.731e+02 1.055e+03 2.311e+03, threshold=1.346e+03, percent-clipped=14.0 2023-06-26 15:55:05,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1603002.0, ans=0.125 2023-06-26 15:55:48,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1603122.0, ans=0.125 2023-06-26 15:55:52,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1603182.0, ans=0.0 2023-06-26 15:56:14,184 INFO [train.py:996] (3/4) Epoch 9, batch 23250, loss[loss=0.2423, simple_loss=0.3031, pruned_loss=0.09073, over 21625.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2883, pruned_loss=0.07148, over 4293956.75 frames. ], batch size: 471, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:57:05,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1603362.0, ans=0.125 2023-06-26 15:57:07,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=22.5 2023-06-26 15:57:34,145 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-26 15:58:08,881 INFO [train.py:996] (3/4) Epoch 9, batch 23300, loss[loss=0.2383, simple_loss=0.3425, pruned_loss=0.06706, over 21428.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2954, pruned_loss=0.07268, over 4299790.77 frames. ], batch size: 211, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 15:58:37,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.694e+02 6.000e+02 9.033e+02 1.405e+03 3.617e+03, threshold=1.807e+03, percent-clipped=26.0 2023-06-26 15:58:47,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1603602.0, ans=0.035 2023-06-26 15:58:57,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1603662.0, ans=0.125 2023-06-26 15:59:00,155 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 15:59:10,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1603662.0, ans=0.125 2023-06-26 16:00:01,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-26 16:00:05,620 INFO [train.py:996] (3/4) Epoch 9, batch 23350, loss[loss=0.1535, simple_loss=0.2318, pruned_loss=0.03759, over 21196.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3011, pruned_loss=0.07226, over 4299500.95 frames. ], batch size: 159, lr: 3.24e-03, grad_scale: 16.0 2023-06-26 16:00:16,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1603842.0, ans=0.0 2023-06-26 16:00:30,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1603902.0, ans=10.0 2023-06-26 16:00:42,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1603902.0, ans=0.0 2023-06-26 16:01:01,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1604022.0, ans=0.1 2023-06-26 16:01:26,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1604022.0, ans=0.125 2023-06-26 16:01:32,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1604082.0, ans=0.0 2023-06-26 16:01:47,885 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-26 16:01:50,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1604082.0, ans=0.0 2023-06-26 16:01:52,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1604142.0, ans=0.125 2023-06-26 16:01:53,509 INFO [train.py:996] (3/4) Epoch 9, batch 23400, loss[loss=0.1856, simple_loss=0.2583, pruned_loss=0.05643, over 21440.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2941, pruned_loss=0.0685, over 4299470.23 frames. ], batch size: 211, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:02:15,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1604202.0, ans=0.0 2023-06-26 16:02:21,753 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.225e+02 5.479e+02 7.119e+02 1.024e+03 2.077e+03, threshold=1.424e+03, percent-clipped=2.0 2023-06-26 16:02:31,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1604202.0, ans=0.1 2023-06-26 16:03:13,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2023-06-26 16:03:14,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1604322.0, ans=0.125 2023-06-26 16:03:32,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1604382.0, ans=0.0 2023-06-26 16:03:47,233 INFO [train.py:996] (3/4) Epoch 9, batch 23450, loss[loss=0.2824, simple_loss=0.3345, pruned_loss=0.1151, over 21303.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2944, pruned_loss=0.07082, over 4300528.51 frames. ], batch size: 507, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:04:13,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1604502.0, ans=0.04949747468305833 2023-06-26 16:05:28,872 INFO [train.py:996] (3/4) Epoch 9, batch 23500, loss[loss=0.2205, simple_loss=0.2781, pruned_loss=0.08148, over 21621.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2951, pruned_loss=0.07223, over 4293120.87 frames. ], batch size: 548, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:05:55,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1604802.0, ans=0.125 2023-06-26 16:05:56,210 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.762e+02 6.548e+02 9.049e+02 1.310e+03 3.325e+03, threshold=1.810e+03, percent-clipped=21.0 2023-06-26 16:06:37,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=15.0 2023-06-26 16:06:55,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1604922.0, ans=0.125 2023-06-26 16:07:15,821 INFO [train.py:996] (3/4) Epoch 9, batch 23550, loss[loss=0.2107, simple_loss=0.2773, pruned_loss=0.07201, over 21302.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2895, pruned_loss=0.07148, over 4290257.90 frames. ], batch size: 131, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:07:28,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1605042.0, ans=0.125 2023-06-26 16:07:38,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1605102.0, ans=0.125 2023-06-26 16:07:47,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1605102.0, ans=0.0 2023-06-26 16:07:47,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1605102.0, ans=0.125 2023-06-26 16:08:21,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1605162.0, ans=15.0 2023-06-26 16:09:03,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-26 16:09:04,142 INFO [train.py:996] (3/4) Epoch 9, batch 23600, loss[loss=0.2248, simple_loss=0.303, pruned_loss=0.0733, over 21987.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2907, pruned_loss=0.07173, over 4288438.45 frames. ], batch size: 317, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:09:15,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1605342.0, ans=0.0 2023-06-26 16:09:20,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1605342.0, ans=0.125 2023-06-26 16:09:24,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1605342.0, ans=0.0 2023-06-26 16:09:32,651 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.846e+02 5.348e+02 7.374e+02 1.134e+03 2.536e+03, threshold=1.475e+03, percent-clipped=3.0 2023-06-26 16:09:55,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1605462.0, ans=0.0 2023-06-26 16:10:24,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-26 16:10:38,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1605582.0, ans=0.035 2023-06-26 16:10:38,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1605582.0, ans=0.0 2023-06-26 16:10:55,356 INFO [train.py:996] (3/4) Epoch 9, batch 23650, loss[loss=0.2009, simple_loss=0.2818, pruned_loss=0.06003, over 21728.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2914, pruned_loss=0.07011, over 4295270.98 frames. ], batch size: 247, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:11:08,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1605642.0, ans=0.07 2023-06-26 16:11:08,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1605642.0, ans=0.0 2023-06-26 16:11:15,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-06-26 16:11:25,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1605702.0, ans=0.0 2023-06-26 16:12:21,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1605822.0, ans=0.125 2023-06-26 16:12:21,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1605822.0, ans=15.0 2023-06-26 16:12:23,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1605822.0, ans=0.1 2023-06-26 16:12:43,719 INFO [train.py:996] (3/4) Epoch 9, batch 23700, loss[loss=0.2631, simple_loss=0.3319, pruned_loss=0.09719, over 21396.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2943, pruned_loss=0.07004, over 4295644.05 frames. ], batch size: 507, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:13:18,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.291e+02 4.703e+02 6.208e+02 8.925e+02 2.253e+03, threshold=1.242e+03, percent-clipped=5.0 2023-06-26 16:13:42,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1606062.0, ans=0.0 2023-06-26 16:14:13,374 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 16:14:33,483 INFO [train.py:996] (3/4) Epoch 9, batch 23750, loss[loss=0.2169, simple_loss=0.3171, pruned_loss=0.05832, over 21668.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2981, pruned_loss=0.07105, over 4291895.80 frames. ], batch size: 441, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:15:18,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1606302.0, ans=0.2 2023-06-26 16:15:33,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1606362.0, ans=0.0 2023-06-26 16:16:27,405 INFO [train.py:996] (3/4) Epoch 9, batch 23800, loss[loss=0.2498, simple_loss=0.3473, pruned_loss=0.07617, over 21626.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2953, pruned_loss=0.06858, over 4284887.79 frames. ], batch size: 389, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:16:53,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1606542.0, ans=0.125 2023-06-26 16:17:04,098 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.330e+02 5.235e+02 7.849e+02 1.092e+03 2.188e+03, threshold=1.570e+03, percent-clipped=19.0 2023-06-26 16:17:06,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1606602.0, ans=0.0 2023-06-26 16:17:12,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1606602.0, ans=0.0 2023-06-26 16:17:23,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1606662.0, ans=0.0 2023-06-26 16:17:53,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1606722.0, ans=0.125 2023-06-26 16:18:29,062 INFO [train.py:996] (3/4) Epoch 9, batch 23850, loss[loss=0.231, simple_loss=0.3055, pruned_loss=0.07828, over 21491.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3041, pruned_loss=0.0707, over 4277705.81 frames. ], batch size: 211, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:19:13,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1606962.0, ans=0.0 2023-06-26 16:19:13,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1606962.0, ans=0.125 2023-06-26 16:19:23,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1606962.0, ans=0.1 2023-06-26 16:19:28,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1607022.0, ans=0.05 2023-06-26 16:19:49,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1607022.0, ans=0.0 2023-06-26 16:20:16,533 INFO [train.py:996] (3/4) Epoch 9, batch 23900, loss[loss=0.2111, simple_loss=0.2928, pruned_loss=0.06465, over 21773.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3105, pruned_loss=0.0729, over 4278646.06 frames. ], batch size: 124, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:20:45,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.978e+02 7.206e+02 9.928e+02 1.468e+03 4.059e+03, threshold=1.986e+03, percent-clipped=20.0 2023-06-26 16:20:54,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1607262.0, ans=0.1 2023-06-26 16:21:03,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-26 16:21:35,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1607322.0, ans=0.2 2023-06-26 16:22:02,566 INFO [train.py:996] (3/4) Epoch 9, batch 23950, loss[loss=0.2359, simple_loss=0.3103, pruned_loss=0.0808, over 21449.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3041, pruned_loss=0.07238, over 4273586.32 frames. ], batch size: 131, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:22:14,480 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.66 vs. limit=15.0 2023-06-26 16:22:24,684 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=12.0 2023-06-26 16:23:35,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1607682.0, ans=0.125 2023-06-26 16:23:50,728 INFO [train.py:996] (3/4) Epoch 9, batch 24000, loss[loss=0.3026, simple_loss=0.3555, pruned_loss=0.1249, over 21437.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3044, pruned_loss=0.07491, over 4263807.00 frames. ], batch size: 510, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:23:50,728 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 16:24:10,702 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2632, simple_loss=0.3589, pruned_loss=0.0837, over 1796401.00 frames. 2023-06-26 16:24:10,703 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 16:24:36,337 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.989e+02 5.715e+02 7.802e+02 1.213e+03 2.324e+03, threshold=1.560e+03, percent-clipped=4.0 2023-06-26 16:26:00,887 INFO [train.py:996] (3/4) Epoch 9, batch 24050, loss[loss=0.2265, simple_loss=0.3133, pruned_loss=0.06984, over 21605.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3062, pruned_loss=0.07591, over 4271752.85 frames. ], batch size: 414, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:26:16,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-26 16:26:44,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1608162.0, ans=0.1 2023-06-26 16:26:46,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608162.0, ans=0.1 2023-06-26 16:27:15,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1608222.0, ans=0.125 2023-06-26 16:27:39,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1608282.0, ans=0.0 2023-06-26 16:27:49,887 INFO [train.py:996] (3/4) Epoch 9, batch 24100, loss[loss=0.2308, simple_loss=0.3345, pruned_loss=0.06358, over 20785.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3071, pruned_loss=0.07482, over 4277097.25 frames. ], batch size: 607, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:27:56,625 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-26 16:27:59,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1608342.0, ans=0.025 2023-06-26 16:28:27,566 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.963e+02 5.200e+02 7.145e+02 1.046e+03 2.381e+03, threshold=1.429e+03, percent-clipped=3.0 2023-06-26 16:28:34,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1608462.0, ans=0.05 2023-06-26 16:29:07,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1608522.0, ans=0.125 2023-06-26 16:29:21,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1608582.0, ans=0.0 2023-06-26 16:29:27,786 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-26 16:29:36,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608582.0, ans=0.1 2023-06-26 16:29:39,186 INFO [train.py:996] (3/4) Epoch 9, batch 24150, loss[loss=0.2643, simple_loss=0.3405, pruned_loss=0.09409, over 21855.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3067, pruned_loss=0.07621, over 4284689.89 frames. ], batch size: 107, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:30:15,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1608702.0, ans=0.1 2023-06-26 16:31:14,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=12.0 2023-06-26 16:31:21,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1608882.0, ans=0.1 2023-06-26 16:31:29,814 INFO [train.py:996] (3/4) Epoch 9, batch 24200, loss[loss=0.2075, simple_loss=0.2937, pruned_loss=0.0606, over 21627.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3086, pruned_loss=0.07727, over 4287772.69 frames. ], batch size: 230, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:32:12,919 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.636e+02 5.699e+02 8.049e+02 1.259e+03 2.421e+03, threshold=1.610e+03, percent-clipped=17.0 2023-06-26 16:32:20,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1609062.0, ans=0.2 2023-06-26 16:32:47,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1609122.0, ans=0.125 2023-06-26 16:33:31,007 INFO [train.py:996] (3/4) Epoch 9, batch 24250, loss[loss=0.17, simple_loss=0.2703, pruned_loss=0.03489, over 21672.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3043, pruned_loss=0.07051, over 4289200.70 frames. ], batch size: 263, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:34:01,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1609302.0, ans=0.0 2023-06-26 16:34:02,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1609302.0, ans=0.0 2023-06-26 16:35:12,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1609482.0, ans=0.125 2023-06-26 16:35:18,896 INFO [train.py:996] (3/4) Epoch 9, batch 24300, loss[loss=0.1674, simple_loss=0.2469, pruned_loss=0.04396, over 21826.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2978, pruned_loss=0.06555, over 4284218.87 frames. ], batch size: 107, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:35:22,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1609542.0, ans=0.1 2023-06-26 16:35:50,231 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.136e+02 4.084e+02 7.233e+02 1.324e+03 4.143e+03, threshold=1.447e+03, percent-clipped=16.0 2023-06-26 16:35:50,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1609602.0, ans=0.0 2023-06-26 16:36:40,389 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-26 16:37:07,348 INFO [train.py:996] (3/4) Epoch 9, batch 24350, loss[loss=0.2201, simple_loss=0.293, pruned_loss=0.07365, over 21019.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2935, pruned_loss=0.06517, over 4286498.48 frames. ], batch size: 607, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:38:20,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1610022.0, ans=0.2 2023-06-26 16:39:02,416 INFO [train.py:996] (3/4) Epoch 9, batch 24400, loss[loss=0.2563, simple_loss=0.3262, pruned_loss=0.09318, over 21443.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2983, pruned_loss=0.06909, over 4288795.34 frames. ], batch size: 471, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:39:13,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1610142.0, ans=0.2 2023-06-26 16:39:34,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.576e+02 5.148e+02 6.716e+02 1.029e+03 2.743e+03, threshold=1.343e+03, percent-clipped=7.0 2023-06-26 16:40:28,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1610382.0, ans=0.05 2023-06-26 16:40:46,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1610382.0, ans=0.1 2023-06-26 16:40:52,860 INFO [train.py:996] (3/4) Epoch 9, batch 24450, loss[loss=0.2377, simple_loss=0.3316, pruned_loss=0.07194, over 21898.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3019, pruned_loss=0.07074, over 4286749.56 frames. ], batch size: 372, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:41:54,549 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=12.0 2023-06-26 16:42:05,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1610622.0, ans=0.125 2023-06-26 16:42:41,526 INFO [train.py:996] (3/4) Epoch 9, batch 24500, loss[loss=0.2035, simple_loss=0.2872, pruned_loss=0.05996, over 21461.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3006, pruned_loss=0.07011, over 4290113.81 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:43:08,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1610802.0, ans=0.125 2023-06-26 16:43:14,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.443e+02 5.135e+02 6.610e+02 1.095e+03 2.710e+03, threshold=1.322e+03, percent-clipped=12.0 2023-06-26 16:43:47,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1610922.0, ans=0.125 2023-06-26 16:44:04,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1610922.0, ans=0.125 2023-06-26 16:44:35,210 INFO [train.py:996] (3/4) Epoch 9, batch 24550, loss[loss=0.2412, simple_loss=0.3177, pruned_loss=0.0823, over 21924.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3034, pruned_loss=0.07262, over 4292674.54 frames. ], batch size: 372, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:44:54,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1611102.0, ans=0.125 2023-06-26 16:45:27,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1611162.0, ans=0.95 2023-06-26 16:46:09,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1611282.0, ans=0.0 2023-06-26 16:46:16,620 INFO [train.py:996] (3/4) Epoch 9, batch 24600, loss[loss=0.2519, simple_loss=0.3108, pruned_loss=0.09644, over 21474.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2978, pruned_loss=0.07236, over 4292768.83 frames. ], batch size: 509, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:46:48,949 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.731e+02 5.495e+02 6.731e+02 9.246e+02 1.741e+03, threshold=1.346e+03, percent-clipped=6.0 2023-06-26 16:48:05,291 INFO [train.py:996] (3/4) Epoch 9, batch 24650, loss[loss=0.1893, simple_loss=0.2534, pruned_loss=0.06262, over 15698.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.289, pruned_loss=0.07091, over 4287070.85 frames. ], batch size: 64, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:48:23,717 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.59 vs. limit=6.0 2023-06-26 16:48:35,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1611702.0, ans=0.125 2023-06-26 16:49:58,454 INFO [train.py:996] (3/4) Epoch 9, batch 24700, loss[loss=0.1876, simple_loss=0.262, pruned_loss=0.05655, over 21339.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2886, pruned_loss=0.06989, over 4284215.69 frames. ], batch size: 211, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:50:10,173 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.38 vs. limit=10.0 2023-06-26 16:50:31,842 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.355e+02 4.942e+02 6.984e+02 9.406e+02 2.267e+03, threshold=1.397e+03, percent-clipped=8.0 2023-06-26 16:50:55,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.80 vs. limit=15.0 2023-06-26 16:50:56,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1612062.0, ans=0.0 2023-06-26 16:51:46,648 INFO [train.py:996] (3/4) Epoch 9, batch 24750, loss[loss=0.1957, simple_loss=0.2666, pruned_loss=0.06243, over 16300.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2831, pruned_loss=0.06781, over 4257658.55 frames. ], batch size: 67, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:52:10,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1612302.0, ans=0.125 2023-06-26 16:52:40,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1612362.0, ans=0.0 2023-06-26 16:52:43,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1612362.0, ans=0.0 2023-06-26 16:52:54,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1612422.0, ans=0.2 2023-06-26 16:53:29,858 INFO [train.py:996] (3/4) Epoch 9, batch 24800, loss[loss=0.2029, simple_loss=0.2638, pruned_loss=0.07106, over 21700.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2793, pruned_loss=0.06701, over 4256929.03 frames. ], batch size: 282, lr: 3.23e-03, grad_scale: 32.0 2023-06-26 16:53:32,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1612542.0, ans=0.125 2023-06-26 16:53:40,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1612542.0, ans=0.125 2023-06-26 16:53:49,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1612542.0, ans=0.04949747468305833 2023-06-26 16:54:10,073 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.499e+02 5.335e+02 8.218e+02 1.489e+03 3.682e+03, threshold=1.644e+03, percent-clipped=29.0 2023-06-26 16:55:05,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.35 vs. limit=10.0 2023-06-26 16:55:18,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1612842.0, ans=0.0 2023-06-26 16:55:20,266 INFO [train.py:996] (3/4) Epoch 9, batch 24850, loss[loss=0.1783, simple_loss=0.2452, pruned_loss=0.0557, over 21429.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2807, pruned_loss=0.069, over 4256585.87 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:55:36,397 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 16:56:05,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1612902.0, ans=0.0 2023-06-26 16:56:05,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1612902.0, ans=0.1 2023-06-26 16:56:12,476 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.12 vs. limit=10.0 2023-06-26 16:56:35,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-26 16:56:42,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1613022.0, ans=0.125 2023-06-26 16:56:43,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1613022.0, ans=0.0 2023-06-26 16:56:55,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-26 16:57:14,582 INFO [train.py:996] (3/4) Epoch 9, batch 24900, loss[loss=0.2523, simple_loss=0.326, pruned_loss=0.08932, over 21599.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2827, pruned_loss=0.06867, over 4264838.28 frames. ], batch size: 389, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:57:52,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1613202.0, ans=0.2 2023-06-26 16:57:54,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-26 16:57:54,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.038e+02 5.408e+02 8.463e+02 1.347e+03 2.375e+03, threshold=1.693e+03, percent-clipped=14.0 2023-06-26 16:58:57,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1613382.0, ans=0.07 2023-06-26 16:58:59,921 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-06-26 16:59:11,114 INFO [train.py:996] (3/4) Epoch 9, batch 24950, loss[loss=0.2526, simple_loss=0.3275, pruned_loss=0.08884, over 21391.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2911, pruned_loss=0.07314, over 4267459.45 frames. ], batch size: 159, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 16:59:27,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1613442.0, ans=0.0 2023-06-26 17:00:39,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1613682.0, ans=0.0 2023-06-26 17:01:05,685 INFO [train.py:996] (3/4) Epoch 9, batch 25000, loss[loss=0.1861, simple_loss=0.2316, pruned_loss=0.0703, over 20323.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2968, pruned_loss=0.07403, over 4267589.37 frames. ], batch size: 703, lr: 3.23e-03, grad_scale: 16.0 2023-06-26 17:01:40,155 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.080e+02 5.287e+02 8.385e+02 1.349e+03 3.356e+03, threshold=1.677e+03, percent-clipped=10.0 2023-06-26 17:02:15,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1613922.0, ans=0.5 2023-06-26 17:02:52,735 INFO [train.py:996] (3/4) Epoch 9, batch 25050, loss[loss=0.1817, simple_loss=0.2426, pruned_loss=0.06041, over 21218.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2899, pruned_loss=0.07254, over 4267706.91 frames. ], batch size: 549, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:02:57,222 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=22.5 2023-06-26 17:03:00,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1614042.0, ans=0.0 2023-06-26 17:03:30,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1614102.0, ans=0.2 2023-06-26 17:03:57,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1614222.0, ans=0.0 2023-06-26 17:04:00,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1614222.0, ans=0.125 2023-06-26 17:04:40,832 INFO [train.py:996] (3/4) Epoch 9, batch 25100, loss[loss=0.1734, simple_loss=0.2422, pruned_loss=0.05235, over 21636.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2848, pruned_loss=0.07102, over 4272005.94 frames. ], batch size: 247, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:04:54,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1614342.0, ans=0.0 2023-06-26 17:05:15,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.352e+02 5.789e+02 8.437e+02 1.364e+03 2.592e+03, threshold=1.687e+03, percent-clipped=13.0 2023-06-26 17:05:24,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1614462.0, ans=0.125 2023-06-26 17:06:16,719 INFO [train.py:996] (3/4) Epoch 9, batch 25150, loss[loss=0.1826, simple_loss=0.2672, pruned_loss=0.04904, over 15986.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2883, pruned_loss=0.06892, over 4264875.30 frames. ], batch size: 61, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:07:07,003 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-26 17:07:10,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1614762.0, ans=0.125 2023-06-26 17:08:05,202 INFO [train.py:996] (3/4) Epoch 9, batch 25200, loss[loss=0.1997, simple_loss=0.2894, pruned_loss=0.05498, over 20880.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2884, pruned_loss=0.06724, over 4260689.03 frames. ], batch size: 608, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:08:14,474 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:08:38,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1615002.0, ans=0.125 2023-06-26 17:08:40,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1615002.0, ans=0.125 2023-06-26 17:08:50,508 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.256e+02 4.732e+02 7.162e+02 1.048e+03 3.410e+03, threshold=1.432e+03, percent-clipped=11.0 2023-06-26 17:08:53,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1615002.0, ans=0.1 2023-06-26 17:09:15,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1615122.0, ans=0.1 2023-06-26 17:09:45,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1615182.0, ans=0.125 2023-06-26 17:09:48,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.47 vs. limit=15.0 2023-06-26 17:09:52,308 INFO [train.py:996] (3/4) Epoch 9, batch 25250, loss[loss=0.213, simple_loss=0.2853, pruned_loss=0.07037, over 21444.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.286, pruned_loss=0.06531, over 4259899.40 frames. ], batch size: 389, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:11:07,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-26 17:11:39,512 INFO [train.py:996] (3/4) Epoch 9, batch 25300, loss[loss=0.1798, simple_loss=0.2416, pruned_loss=0.05898, over 17266.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2826, pruned_loss=0.0648, over 4253371.53 frames. ], batch size: 62, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:11:51,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1615542.0, ans=0.125 2023-06-26 17:12:22,272 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.284e+02 5.811e+02 7.982e+02 1.248e+03 2.930e+03, threshold=1.596e+03, percent-clipped=17.0 2023-06-26 17:12:25,067 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-26 17:12:28,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1615662.0, ans=0.125 2023-06-26 17:12:36,230 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1615662.0, ans=0.125 2023-06-26 17:13:13,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1615782.0, ans=0.0 2023-06-26 17:13:29,735 INFO [train.py:996] (3/4) Epoch 9, batch 25350, loss[loss=0.1777, simple_loss=0.2663, pruned_loss=0.04458, over 21704.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2854, pruned_loss=0.06528, over 4245648.25 frames. ], batch size: 332, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:14:03,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1615902.0, ans=0.0 2023-06-26 17:14:28,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1615962.0, ans=0.2 2023-06-26 17:14:42,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1616022.0, ans=0.125 2023-06-26 17:15:12,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1616082.0, ans=0.05 2023-06-26 17:15:17,092 INFO [train.py:996] (3/4) Epoch 9, batch 25400, loss[loss=0.2166, simple_loss=0.2867, pruned_loss=0.07329, over 21592.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2815, pruned_loss=0.06395, over 4253512.59 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:15:49,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.75 vs. limit=15.0 2023-06-26 17:15:54,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1616202.0, ans=0.125 2023-06-26 17:15:55,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1616202.0, ans=0.125 2023-06-26 17:15:58,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 5.059e+02 8.454e+02 1.158e+03 2.444e+03, threshold=1.691e+03, percent-clipped=8.0 2023-06-26 17:16:47,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-26 17:16:59,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1616382.0, ans=0.09899494936611666 2023-06-26 17:17:05,786 INFO [train.py:996] (3/4) Epoch 9, batch 25450, loss[loss=0.2107, simple_loss=0.3118, pruned_loss=0.05478, over 21675.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2822, pruned_loss=0.06565, over 4259109.81 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:17:29,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1616442.0, ans=0.0 2023-06-26 17:17:36,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1616502.0, ans=0.125 2023-06-26 17:17:54,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1616502.0, ans=0.125 2023-06-26 17:18:24,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1616622.0, ans=0.1 2023-06-26 17:18:50,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1616682.0, ans=0.0 2023-06-26 17:18:55,060 INFO [train.py:996] (3/4) Epoch 9, batch 25500, loss[loss=0.1955, simple_loss=0.29, pruned_loss=0.05052, over 21749.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2826, pruned_loss=0.06322, over 4268506.77 frames. ], batch size: 332, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:19:12,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1616742.0, ans=0.07 2023-06-26 17:19:17,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1616742.0, ans=0.125 2023-06-26 17:19:43,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.118e+02 5.222e+02 7.710e+02 1.108e+03 2.263e+03, threshold=1.542e+03, percent-clipped=6.0 2023-06-26 17:20:40,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1616982.0, ans=0.95 2023-06-26 17:20:56,303 INFO [train.py:996] (3/4) Epoch 9, batch 25550, loss[loss=0.2719, simple_loss=0.3625, pruned_loss=0.09062, over 21525.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2909, pruned_loss=0.06448, over 4270347.54 frames. ], batch size: 507, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:20:58,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1617042.0, ans=0.125 2023-06-26 17:21:08,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1617042.0, ans=0.1 2023-06-26 17:21:28,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1617102.0, ans=0.0 2023-06-26 17:21:58,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1617162.0, ans=0.1 2023-06-26 17:22:11,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-26 17:22:45,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=1617342.0, ans=12.0 2023-06-26 17:22:46,550 INFO [train.py:996] (3/4) Epoch 9, batch 25600, loss[loss=0.2992, simple_loss=0.3551, pruned_loss=0.1217, over 21378.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2942, pruned_loss=0.06466, over 4270841.13 frames. ], batch size: 507, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:23:29,861 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.761e+02 5.184e+02 7.757e+02 1.041e+03 2.426e+03, threshold=1.551e+03, percent-clipped=8.0 2023-06-26 17:23:40,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1617462.0, ans=0.125 2023-06-26 17:23:56,321 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:24:15,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1617522.0, ans=0.2 2023-06-26 17:24:25,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1617582.0, ans=0.1 2023-06-26 17:24:36,582 INFO [train.py:996] (3/4) Epoch 9, batch 25650, loss[loss=0.2081, simple_loss=0.2931, pruned_loss=0.0616, over 19837.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.294, pruned_loss=0.0672, over 4268830.93 frames. ], batch size: 702, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:25:24,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1617762.0, ans=0.125 2023-06-26 17:26:12,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1617882.0, ans=0.125 2023-06-26 17:26:24,643 INFO [train.py:996] (3/4) Epoch 9, batch 25700, loss[loss=0.2212, simple_loss=0.3004, pruned_loss=0.07096, over 21201.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2908, pruned_loss=0.06796, over 4270657.09 frames. ], batch size: 143, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:26:58,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1618002.0, ans=0.125 2023-06-26 17:27:02,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1618002.0, ans=0.2 2023-06-26 17:27:08,740 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.793e+02 5.331e+02 7.573e+02 1.078e+03 3.200e+03, threshold=1.515e+03, percent-clipped=12.0 2023-06-26 17:27:09,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1618062.0, ans=0.125 2023-06-26 17:27:20,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1618062.0, ans=0.0 2023-06-26 17:27:49,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1618122.0, ans=0.2 2023-06-26 17:28:21,566 INFO [train.py:996] (3/4) Epoch 9, batch 25750, loss[loss=0.2568, simple_loss=0.331, pruned_loss=0.09129, over 21259.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2947, pruned_loss=0.07016, over 4267281.92 frames. ], batch size: 159, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:28:58,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1618302.0, ans=0.125 2023-06-26 17:30:18,729 INFO [train.py:996] (3/4) Epoch 9, batch 25800, loss[loss=0.3114, simple_loss=0.3688, pruned_loss=0.127, over 21344.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3077, pruned_loss=0.07494, over 4262312.13 frames. ], batch size: 507, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:30:24,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1618542.0, ans=0.0 2023-06-26 17:30:24,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1618542.0, ans=0.125 2023-06-26 17:31:03,951 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.718e+02 5.908e+02 7.803e+02 1.133e+03 2.789e+03, threshold=1.561e+03, percent-clipped=14.0 2023-06-26 17:31:04,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618662.0, ans=0.1 2023-06-26 17:31:07,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1618662.0, ans=0.125 2023-06-26 17:31:14,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1618662.0, ans=0.125 2023-06-26 17:32:08,633 INFO [train.py:996] (3/4) Epoch 9, batch 25850, loss[loss=0.2294, simple_loss=0.3058, pruned_loss=0.07645, over 21819.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3085, pruned_loss=0.07368, over 4268057.95 frames. ], batch size: 414, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:32:26,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1618842.0, ans=0.125 2023-06-26 17:32:29,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1618842.0, ans=0.125 2023-06-26 17:33:24,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1619022.0, ans=0.125 2023-06-26 17:33:28,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1619022.0, ans=0.0 2023-06-26 17:33:50,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1619082.0, ans=0.02 2023-06-26 17:34:03,394 INFO [train.py:996] (3/4) Epoch 9, batch 25900, loss[loss=0.2136, simple_loss=0.3056, pruned_loss=0.06085, over 19956.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3094, pruned_loss=0.07455, over 4267582.95 frames. ], batch size: 703, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:34:14,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1619142.0, ans=0.0 2023-06-26 17:34:47,600 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.581e+02 5.400e+02 8.685e+02 1.109e+03 2.488e+03, threshold=1.737e+03, percent-clipped=11.0 2023-06-26 17:35:58,964 INFO [train.py:996] (3/4) Epoch 9, batch 25950, loss[loss=0.2459, simple_loss=0.3232, pruned_loss=0.08431, over 21550.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3162, pruned_loss=0.07709, over 4274702.50 frames. ], batch size: 414, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:36:00,135 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-26 17:36:41,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1619502.0, ans=0.1 2023-06-26 17:36:46,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1619562.0, ans=0.1 2023-06-26 17:36:47,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1619562.0, ans=0.125 2023-06-26 17:36:54,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1619562.0, ans=0.025 2023-06-26 17:37:01,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1619622.0, ans=0.2 2023-06-26 17:37:49,208 INFO [train.py:996] (3/4) Epoch 9, batch 26000, loss[loss=0.273, simple_loss=0.3504, pruned_loss=0.09779, over 21820.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3159, pruned_loss=0.0763, over 4272285.08 frames. ], batch size: 124, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:37:57,023 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:38:12,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.32 vs. limit=15.0 2023-06-26 17:38:25,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1619802.0, ans=15.0 2023-06-26 17:38:28,882 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 17:38:33,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.565e+02 5.045e+02 5.850e+02 7.861e+02 1.944e+03, threshold=1.170e+03, percent-clipped=2.0 2023-06-26 17:38:37,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1619862.0, ans=0.2 2023-06-26 17:39:37,967 INFO [train.py:996] (3/4) Epoch 9, batch 26050, loss[loss=0.1934, simple_loss=0.2592, pruned_loss=0.06382, over 21071.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3148, pruned_loss=0.07698, over 4274673.07 frames. ], batch size: 607, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:40:18,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1620102.0, ans=0.125 2023-06-26 17:40:34,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.59 vs. limit=6.0 2023-06-26 17:40:45,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1620222.0, ans=0.0 2023-06-26 17:41:07,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1620282.0, ans=0.2 2023-06-26 17:41:08,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-26 17:41:21,072 INFO [train.py:996] (3/4) Epoch 9, batch 26100, loss[loss=0.2167, simple_loss=0.2862, pruned_loss=0.07366, over 21842.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3092, pruned_loss=0.07655, over 4280643.11 frames. ], batch size: 124, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:41:37,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1620342.0, ans=0.125 2023-06-26 17:41:58,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1620402.0, ans=0.0 2023-06-26 17:42:00,606 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-26 17:42:06,466 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.994e+02 5.585e+02 7.440e+02 1.140e+03 2.110e+03, threshold=1.488e+03, percent-clipped=23.0 2023-06-26 17:42:38,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1620522.0, ans=0.0 2023-06-26 17:42:52,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1620582.0, ans=0.125 2023-06-26 17:42:54,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1620582.0, ans=0.0 2023-06-26 17:43:04,899 INFO [train.py:996] (3/4) Epoch 9, batch 26150, loss[loss=0.2353, simple_loss=0.3219, pruned_loss=0.07438, over 21827.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3066, pruned_loss=0.0765, over 4290967.87 frames. ], batch size: 118, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:43:20,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1620642.0, ans=0.2 2023-06-26 17:43:45,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1620702.0, ans=0.1 2023-06-26 17:43:45,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1620702.0, ans=0.125 2023-06-26 17:45:00,353 INFO [train.py:996] (3/4) Epoch 9, batch 26200, loss[loss=0.1403, simple_loss=0.211, pruned_loss=0.03481, over 17353.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3075, pruned_loss=0.07473, over 4287770.95 frames. ], batch size: 60, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:45:13,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1620942.0, ans=0.05 2023-06-26 17:45:15,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1620942.0, ans=0.0 2023-06-26 17:45:17,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1620942.0, ans=0.125 2023-06-26 17:45:40,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1621062.0, ans=0.0 2023-06-26 17:45:41,649 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.840e+02 5.161e+02 8.097e+02 1.241e+03 2.329e+03, threshold=1.619e+03, percent-clipped=17.0 2023-06-26 17:46:56,618 INFO [train.py:996] (3/4) Epoch 9, batch 26250, loss[loss=0.2239, simple_loss=0.2955, pruned_loss=0.07618, over 21480.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3099, pruned_loss=0.07416, over 4282824.34 frames. ], batch size: 131, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:47:16,674 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-26 17:48:18,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1621482.0, ans=0.125 2023-06-26 17:48:44,903 INFO [train.py:996] (3/4) Epoch 9, batch 26300, loss[loss=0.2243, simple_loss=0.2901, pruned_loss=0.07924, over 21309.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3076, pruned_loss=0.07458, over 4285793.16 frames. ], batch size: 159, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:49:25,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.773e+02 5.088e+02 7.206e+02 1.171e+03 1.823e+03, threshold=1.441e+03, percent-clipped=7.0 2023-06-26 17:50:25,473 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.26 vs. limit=6.0 2023-06-26 17:50:34,483 INFO [train.py:996] (3/4) Epoch 9, batch 26350, loss[loss=0.2498, simple_loss=0.3194, pruned_loss=0.09015, over 21627.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.306, pruned_loss=0.07513, over 4288821.25 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:50:38,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1621842.0, ans=0.1 2023-06-26 17:50:42,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1621842.0, ans=0.2 2023-06-26 17:50:55,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-26 17:51:26,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1621962.0, ans=0.125 2023-06-26 17:51:29,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1621962.0, ans=0.0 2023-06-26 17:51:36,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.74 vs. limit=10.0 2023-06-26 17:52:23,816 INFO [train.py:996] (3/4) Epoch 9, batch 26400, loss[loss=0.21, simple_loss=0.2623, pruned_loss=0.07886, over 21571.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3004, pruned_loss=0.07492, over 4285180.08 frames. ], batch size: 441, lr: 3.22e-03, grad_scale: 32.0 2023-06-26 17:53:12,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.822e+02 5.037e+02 6.959e+02 9.647e+02 1.675e+03, threshold=1.392e+03, percent-clipped=4.0 2023-06-26 17:53:31,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1622262.0, ans=0.125 2023-06-26 17:53:57,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1622382.0, ans=0.125 2023-06-26 17:54:08,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1622382.0, ans=0.125 2023-06-26 17:54:10,546 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-06-26 17:54:16,742 INFO [train.py:996] (3/4) Epoch 9, batch 26450, loss[loss=0.2046, simple_loss=0.2806, pruned_loss=0.06429, over 21163.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2984, pruned_loss=0.074, over 4287006.71 frames. ], batch size: 143, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:54:48,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-26 17:54:53,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.17 vs. limit=22.5 2023-06-26 17:54:57,935 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-26 17:56:13,576 INFO [train.py:996] (3/4) Epoch 9, batch 26500, loss[loss=0.2082, simple_loss=0.2872, pruned_loss=0.06455, over 21848.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3008, pruned_loss=0.07298, over 4277212.25 frames. ], batch size: 282, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:57:00,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1622802.0, ans=0.0 2023-06-26 17:57:07,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.802e+02 5.662e+02 1.052e+03 1.637e+03 4.186e+03, threshold=2.103e+03, percent-clipped=36.0 2023-06-26 17:58:11,200 INFO [train.py:996] (3/4) Epoch 9, batch 26550, loss[loss=0.2161, simple_loss=0.3133, pruned_loss=0.05947, over 21564.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2977, pruned_loss=0.07025, over 4277523.20 frames. ], batch size: 441, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 17:58:20,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1623042.0, ans=0.125 2023-06-26 17:58:58,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1623162.0, ans=0.125 2023-06-26 17:59:08,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1623162.0, ans=0.125 2023-06-26 17:59:45,203 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:00:05,315 INFO [train.py:996] (3/4) Epoch 9, batch 26600, loss[loss=0.2014, simple_loss=0.2921, pruned_loss=0.05531, over 21639.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.298, pruned_loss=0.06784, over 4270954.37 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 18:00:15,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1623342.0, ans=0.1 2023-06-26 18:00:33,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-26 18:00:41,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1623402.0, ans=0.2 2023-06-26 18:00:45,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=12.0 2023-06-26 18:00:47,578 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.393e+02 5.073e+02 7.169e+02 1.139e+03 3.123e+03, threshold=1.434e+03, percent-clipped=9.0 2023-06-26 18:01:19,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1623522.0, ans=0.125 2023-06-26 18:01:59,715 INFO [train.py:996] (3/4) Epoch 9, batch 26650, loss[loss=0.1592, simple_loss=0.2388, pruned_loss=0.03978, over 21264.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.291, pruned_loss=0.06628, over 4271232.54 frames. ], batch size: 160, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 18:02:11,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1623642.0, ans=0.0 2023-06-26 18:02:21,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1623702.0, ans=0.125 2023-06-26 18:02:42,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1623762.0, ans=0.0 2023-06-26 18:03:00,442 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-26 18:03:05,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=12.0 2023-06-26 18:03:17,280 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-26 18:03:40,922 INFO [train.py:996] (3/4) Epoch 9, batch 26700, loss[loss=0.1961, simple_loss=0.2668, pruned_loss=0.06274, over 21815.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.284, pruned_loss=0.06373, over 4264769.22 frames. ], batch size: 298, lr: 3.22e-03, grad_scale: 16.0 2023-06-26 18:03:58,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-26 18:04:29,907 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 4.080e+02 5.616e+02 9.381e+02 2.662e+03, threshold=1.123e+03, percent-clipped=11.0 2023-06-26 18:05:31,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1624182.0, ans=0.04949747468305833 2023-06-26 18:05:36,274 INFO [train.py:996] (3/4) Epoch 9, batch 26750, loss[loss=0.1986, simple_loss=0.2928, pruned_loss=0.05217, over 21702.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2853, pruned_loss=0.06357, over 4274342.66 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:05:45,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1624242.0, ans=0.0 2023-06-26 18:07:27,059 INFO [train.py:996] (3/4) Epoch 9, batch 26800, loss[loss=0.2829, simple_loss=0.3486, pruned_loss=0.1086, over 21426.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2927, pruned_loss=0.06779, over 4281131.00 frames. ], batch size: 471, lr: 3.21e-03, grad_scale: 32.0 2023-06-26 18:07:27,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1624542.0, ans=0.125 2023-06-26 18:07:27,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1624542.0, ans=0.0 2023-06-26 18:07:49,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1624602.0, ans=0.125 2023-06-26 18:08:15,096 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.607e+02 5.810e+02 7.473e+02 1.088e+03 2.811e+03, threshold=1.495e+03, percent-clipped=19.0 2023-06-26 18:08:55,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1624722.0, ans=0.04949747468305833 2023-06-26 18:09:22,002 INFO [train.py:996] (3/4) Epoch 9, batch 26850, loss[loss=0.2075, simple_loss=0.2616, pruned_loss=0.0767, over 20084.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2943, pruned_loss=0.07066, over 4275562.39 frames. ], batch size: 703, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:09:22,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1624842.0, ans=0.125 2023-06-26 18:09:22,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1624842.0, ans=0.2 2023-06-26 18:09:29,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=15.0 2023-06-26 18:10:17,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1624962.0, ans=0.0 2023-06-26 18:10:50,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1625082.0, ans=0.125 2023-06-26 18:11:09,548 INFO [train.py:996] (3/4) Epoch 9, batch 26900, loss[loss=0.1824, simple_loss=0.2487, pruned_loss=0.05804, over 21756.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2865, pruned_loss=0.06926, over 4267628.92 frames. ], batch size: 300, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:11:33,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1625202.0, ans=0.04949747468305833 2023-06-26 18:11:52,557 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.498e+02 4.462e+02 5.999e+02 9.238e+02 1.607e+03, threshold=1.200e+03, percent-clipped=3.0 2023-06-26 18:12:21,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1625322.0, ans=15.0 2023-06-26 18:12:39,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1625382.0, ans=0.0 2023-06-26 18:12:42,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1625382.0, ans=0.0 2023-06-26 18:12:57,976 INFO [train.py:996] (3/4) Epoch 9, batch 26950, loss[loss=0.231, simple_loss=0.3264, pruned_loss=0.06781, over 21769.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2867, pruned_loss=0.0695, over 4271088.88 frames. ], batch size: 282, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:13:18,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1625442.0, ans=0.125 2023-06-26 18:13:53,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-26 18:14:21,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1625622.0, ans=0.0 2023-06-26 18:14:47,864 INFO [train.py:996] (3/4) Epoch 9, batch 27000, loss[loss=0.1846, simple_loss=0.2744, pruned_loss=0.04742, over 21686.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.287, pruned_loss=0.06729, over 4264173.19 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:14:47,865 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 18:15:07,480 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2501, simple_loss=0.3419, pruned_loss=0.07919, over 1796401.00 frames. 2023-06-26 18:15:07,481 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 18:15:37,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1625802.0, ans=0.125 2023-06-26 18:15:39,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1625802.0, ans=0.125 2023-06-26 18:15:59,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.455e+02 5.551e+02 8.937e+02 1.384e+03 3.879e+03, threshold=1.787e+03, percent-clipped=32.0 2023-06-26 18:16:03,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1625862.0, ans=0.125 2023-06-26 18:16:13,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-26 18:16:57,917 INFO [train.py:996] (3/4) Epoch 9, batch 27050, loss[loss=0.2051, simple_loss=0.2951, pruned_loss=0.05758, over 21861.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2895, pruned_loss=0.06441, over 4266708.85 frames. ], batch size: 351, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:17:05,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1626042.0, ans=0.125 2023-06-26 18:17:09,607 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-26 18:17:59,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1626162.0, ans=0.1 2023-06-26 18:18:01,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1626222.0, ans=0.125 2023-06-26 18:18:17,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1626222.0, ans=0.125 2023-06-26 18:18:21,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1626222.0, ans=0.0 2023-06-26 18:18:24,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1626282.0, ans=0.125 2023-06-26 18:18:42,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1626282.0, ans=0.2 2023-06-26 18:18:49,372 INFO [train.py:996] (3/4) Epoch 9, batch 27100, loss[loss=0.2117, simple_loss=0.3139, pruned_loss=0.05475, over 21608.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2919, pruned_loss=0.06578, over 4277900.68 frames. ], batch size: 230, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:19:09,134 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:19:41,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1626462.0, ans=0.125 2023-06-26 18:19:42,259 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.471e+02 6.179e+02 8.599e+02 1.265e+03 2.717e+03, threshold=1.720e+03, percent-clipped=9.0 2023-06-26 18:20:46,677 INFO [train.py:996] (3/4) Epoch 9, batch 27150, loss[loss=0.2349, simple_loss=0.3308, pruned_loss=0.06947, over 21620.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3054, pruned_loss=0.06994, over 4276794.66 frames. ], batch size: 263, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:21:24,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1626702.0, ans=0.125 2023-06-26 18:21:45,957 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-26 18:22:35,002 INFO [train.py:996] (3/4) Epoch 9, batch 27200, loss[loss=0.259, simple_loss=0.3394, pruned_loss=0.08933, over 21434.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3121, pruned_loss=0.07178, over 4279743.86 frames. ], batch size: 131, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:23:25,800 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.290e+02 5.594e+02 8.054e+02 1.283e+03 2.318e+03, threshold=1.611e+03, percent-clipped=7.0 2023-06-26 18:23:30,440 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-26 18:24:07,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1627182.0, ans=0.0 2023-06-26 18:24:15,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-26 18:24:30,175 INFO [train.py:996] (3/4) Epoch 9, batch 27250, loss[loss=0.273, simple_loss=0.3384, pruned_loss=0.1038, over 21408.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.314, pruned_loss=0.0755, over 4280110.87 frames. ], batch size: 471, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:25:36,534 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:25:44,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1627422.0, ans=0.125 2023-06-26 18:25:55,997 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:26:20,979 INFO [train.py:996] (3/4) Epoch 9, batch 27300, loss[loss=0.2225, simple_loss=0.3106, pruned_loss=0.06717, over 21791.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3158, pruned_loss=0.07626, over 4281465.96 frames. ], batch size: 332, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:27:18,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.468e+02 5.640e+02 6.768e+02 9.000e+02 1.859e+03, threshold=1.354e+03, percent-clipped=2.0 2023-06-26 18:28:17,678 INFO [train.py:996] (3/4) Epoch 9, batch 27350, loss[loss=0.2263, simple_loss=0.3099, pruned_loss=0.07136, over 21625.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3173, pruned_loss=0.07675, over 4276653.04 frames. ], batch size: 112, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:29:02,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-26 18:29:42,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1628082.0, ans=0.125 2023-06-26 18:29:52,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1628082.0, ans=0.0 2023-06-26 18:30:04,058 INFO [train.py:996] (3/4) Epoch 9, batch 27400, loss[loss=0.1923, simple_loss=0.257, pruned_loss=0.06377, over 21609.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3114, pruned_loss=0.07588, over 4283724.72 frames. ], batch size: 263, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:30:22,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1628142.0, ans=0.0 2023-06-26 18:30:54,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.725e+02 5.126e+02 6.894e+02 1.011e+03 2.169e+03, threshold=1.379e+03, percent-clipped=11.0 2023-06-26 18:30:54,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1628262.0, ans=0.09899494936611666 2023-06-26 18:31:32,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1628382.0, ans=0.05 2023-06-26 18:31:50,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1628442.0, ans=0.125 2023-06-26 18:31:52,288 INFO [train.py:996] (3/4) Epoch 9, batch 27450, loss[loss=0.226, simple_loss=0.3114, pruned_loss=0.07035, over 21885.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3054, pruned_loss=0.07409, over 4282381.54 frames. ], batch size: 372, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:32:26,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.33 vs. limit=6.0 2023-06-26 18:32:38,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1628562.0, ans=0.09899494936611666 2023-06-26 18:33:17,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1628682.0, ans=0.2 2023-06-26 18:33:38,600 INFO [train.py:996] (3/4) Epoch 9, batch 27500, loss[loss=0.2082, simple_loss=0.2718, pruned_loss=0.07231, over 21843.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3039, pruned_loss=0.07429, over 4283912.00 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:34:05,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1628802.0, ans=0.0 2023-06-26 18:34:29,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.928e+02 5.202e+02 7.866e+02 1.174e+03 2.313e+03, threshold=1.573e+03, percent-clipped=14.0 2023-06-26 18:34:44,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1628862.0, ans=0.015 2023-06-26 18:34:55,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-26 18:35:19,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-06-26 18:35:27,110 INFO [train.py:996] (3/4) Epoch 9, batch 27550, loss[loss=0.2801, simple_loss=0.3786, pruned_loss=0.09083, over 19951.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2989, pruned_loss=0.07167, over 4279950.03 frames. ], batch size: 702, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:36:33,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1629162.0, ans=0.2 2023-06-26 18:37:03,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1629282.0, ans=0.07 2023-06-26 18:37:06,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1629282.0, ans=0.0 2023-06-26 18:37:19,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1629342.0, ans=0.0 2023-06-26 18:37:21,053 INFO [train.py:996] (3/4) Epoch 9, batch 27600, loss[loss=0.2408, simple_loss=0.31, pruned_loss=0.08582, over 19983.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2921, pruned_loss=0.0712, over 4280652.66 frames. ], batch size: 702, lr: 3.21e-03, grad_scale: 32.0 2023-06-26 18:37:34,722 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-26 18:38:11,876 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.490e+02 6.372e+02 8.382e+02 1.316e+03 3.069e+03, threshold=1.676e+03, percent-clipped=15.0 2023-06-26 18:38:15,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1629462.0, ans=0.125 2023-06-26 18:39:07,989 INFO [train.py:996] (3/4) Epoch 9, batch 27650, loss[loss=0.2175, simple_loss=0.2798, pruned_loss=0.07755, over 21616.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2872, pruned_loss=0.07064, over 4268805.07 frames. ], batch size: 389, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:39:11,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1629642.0, ans=0.125 2023-06-26 18:39:38,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1629702.0, ans=0.0 2023-06-26 18:40:05,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1629762.0, ans=0.0 2023-06-26 18:40:24,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-26 18:40:45,029 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-26 18:40:45,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-06-26 18:40:55,820 INFO [train.py:996] (3/4) Epoch 9, batch 27700, loss[loss=0.2535, simple_loss=0.3497, pruned_loss=0.07867, over 20841.00 frames. ], tot_loss[loss=0.213, simple_loss=0.288, pruned_loss=0.06902, over 4274330.41 frames. ], batch size: 608, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:41:05,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1629942.0, ans=0.125 2023-06-26 18:41:17,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1630002.0, ans=0.1 2023-06-26 18:41:47,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.419e+02 4.763e+02 6.253e+02 8.900e+02 1.966e+03, threshold=1.251e+03, percent-clipped=3.0 2023-06-26 18:42:43,170 INFO [train.py:996] (3/4) Epoch 9, batch 27750, loss[loss=0.2277, simple_loss=0.3022, pruned_loss=0.07658, over 21749.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2904, pruned_loss=0.06862, over 4271561.92 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:42:45,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1630242.0, ans=0.125 2023-06-26 18:43:14,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1630302.0, ans=0.125 2023-06-26 18:43:17,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-26 18:44:15,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1630482.0, ans=0.0 2023-06-26 18:44:19,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1630482.0, ans=0.125 2023-06-26 18:44:23,844 INFO [train.py:996] (3/4) Epoch 9, batch 27800, loss[loss=0.2538, simple_loss=0.3282, pruned_loss=0.0897, over 21852.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2896, pruned_loss=0.06879, over 4274242.97 frames. ], batch size: 107, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:45:23,032 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.595e+02 5.099e+02 6.470e+02 1.005e+03 1.791e+03, threshold=1.294e+03, percent-clipped=14.0 2023-06-26 18:45:28,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1630662.0, ans=0.0 2023-06-26 18:45:58,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1630782.0, ans=0.0 2023-06-26 18:46:18,782 INFO [train.py:996] (3/4) Epoch 9, batch 27850, loss[loss=0.2451, simple_loss=0.3246, pruned_loss=0.0828, over 21740.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2885, pruned_loss=0.06958, over 4282713.57 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:46:47,289 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=6.0 2023-06-26 18:47:01,614 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-26 18:47:36,955 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-26 18:47:39,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1631022.0, ans=0.0 2023-06-26 18:47:54,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1631082.0, ans=0.05 2023-06-26 18:48:11,025 INFO [train.py:996] (3/4) Epoch 9, batch 27900, loss[loss=0.1891, simple_loss=0.2808, pruned_loss=0.04875, over 21676.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2978, pruned_loss=0.07068, over 4285855.96 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:48:47,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1631202.0, ans=0.0 2023-06-26 18:49:04,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1631262.0, ans=0.125 2023-06-26 18:49:12,748 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.641e+02 5.533e+02 7.337e+02 1.067e+03 2.093e+03, threshold=1.467e+03, percent-clipped=13.0 2023-06-26 18:49:35,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1631322.0, ans=0.125 2023-06-26 18:50:09,118 INFO [train.py:996] (3/4) Epoch 9, batch 27950, loss[loss=0.157, simple_loss=0.242, pruned_loss=0.03605, over 21493.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.298, pruned_loss=0.06763, over 4282838.93 frames. ], batch size: 195, lr: 3.21e-03, grad_scale: 8.0 2023-06-26 18:51:00,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1631562.0, ans=0.0 2023-06-26 18:51:19,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1631622.0, ans=0.125 2023-06-26 18:51:37,240 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:51:47,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1631682.0, ans=0.05 2023-06-26 18:51:51,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1631682.0, ans=0.125 2023-06-26 18:51:58,504 INFO [train.py:996] (3/4) Epoch 9, batch 28000, loss[loss=0.235, simple_loss=0.2985, pruned_loss=0.08576, over 21218.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2951, pruned_loss=0.06542, over 4288086.33 frames. ], batch size: 143, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:52:14,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1631742.0, ans=0.125 2023-06-26 18:52:27,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1631802.0, ans=0.125 2023-06-26 18:52:53,930 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.093e+02 5.535e+02 9.213e+02 1.280e+03 3.629e+03, threshold=1.843e+03, percent-clipped=20.0 2023-06-26 18:53:23,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1631922.0, ans=0.025 2023-06-26 18:53:27,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1631982.0, ans=0.125 2023-06-26 18:53:56,215 INFO [train.py:996] (3/4) Epoch 9, batch 28050, loss[loss=0.2275, simple_loss=0.318, pruned_loss=0.06847, over 21713.00 frames. ], tot_loss[loss=0.213, simple_loss=0.293, pruned_loss=0.06647, over 4288639.74 frames. ], batch size: 414, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:54:19,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1632102.0, ans=0.125 2023-06-26 18:55:44,938 INFO [train.py:996] (3/4) Epoch 9, batch 28100, loss[loss=0.2121, simple_loss=0.307, pruned_loss=0.05864, over 20833.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2907, pruned_loss=0.06683, over 4288710.41 frames. ], batch size: 608, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:55:45,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1632342.0, ans=0.5 2023-06-26 18:55:53,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1632342.0, ans=0.125 2023-06-26 18:56:17,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1632402.0, ans=0.0 2023-06-26 18:56:36,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1632462.0, ans=0.05 2023-06-26 18:56:37,148 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.724e+02 5.263e+02 6.694e+02 1.046e+03 2.729e+03, threshold=1.339e+03, percent-clipped=5.0 2023-06-26 18:56:37,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1632462.0, ans=0.125 2023-06-26 18:57:28,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1632642.0, ans=0.5 2023-06-26 18:57:29,620 INFO [train.py:996] (3/4) Epoch 9, batch 28150, loss[loss=0.1748, simple_loss=0.2425, pruned_loss=0.05359, over 21460.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.285, pruned_loss=0.06627, over 4290673.16 frames. ], batch size: 212, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:57:38,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1632642.0, ans=0.0 2023-06-26 18:57:57,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1632702.0, ans=0.125 2023-06-26 18:58:00,662 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 18:58:23,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1632762.0, ans=0.04949747468305833 2023-06-26 18:58:23,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1632762.0, ans=6.0 2023-06-26 18:59:04,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1632882.0, ans=0.125 2023-06-26 18:59:17,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1632942.0, ans=0.2 2023-06-26 18:59:18,532 INFO [train.py:996] (3/4) Epoch 9, batch 28200, loss[loss=0.2418, simple_loss=0.3192, pruned_loss=0.08219, over 21777.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2841, pruned_loss=0.06798, over 4275180.91 frames. ], batch size: 124, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 18:59:24,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1632942.0, ans=0.0 2023-06-26 18:59:36,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1632942.0, ans=0.125 2023-06-26 18:59:49,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1633002.0, ans=0.0 2023-06-26 19:00:13,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.542e+02 6.148e+02 9.394e+02 1.401e+03 3.381e+03, threshold=1.879e+03, percent-clipped=30.0 2023-06-26 19:00:13,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1633062.0, ans=0.125 2023-06-26 19:00:21,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1633062.0, ans=0.0 2023-06-26 19:00:48,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1633182.0, ans=0.125 2023-06-26 19:00:50,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1633182.0, ans=0.5 2023-06-26 19:01:07,255 INFO [train.py:996] (3/4) Epoch 9, batch 28250, loss[loss=0.1943, simple_loss=0.2638, pruned_loss=0.06237, over 21642.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2868, pruned_loss=0.06922, over 4264240.56 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:02:09,881 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 19:02:11,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-26 19:02:13,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1633362.0, ans=0.2 2023-06-26 19:02:15,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1633422.0, ans=0.125 2023-06-26 19:02:15,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-26 19:03:02,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1633542.0, ans=0.125 2023-06-26 19:03:03,814 INFO [train.py:996] (3/4) Epoch 9, batch 28300, loss[loss=0.2212, simple_loss=0.3107, pruned_loss=0.06583, over 21509.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2847, pruned_loss=0.06821, over 4261379.40 frames. ], batch size: 508, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:03:11,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1633542.0, ans=0.125 2023-06-26 19:03:24,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1633542.0, ans=0.2 2023-06-26 19:03:25,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1633602.0, ans=0.1 2023-06-26 19:03:58,617 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.505e+02 4.596e+02 7.876e+02 1.186e+03 2.671e+03, threshold=1.575e+03, percent-clipped=4.0 2023-06-26 19:04:10,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1633662.0, ans=0.125 2023-06-26 19:04:31,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1633782.0, ans=0.125 2023-06-26 19:04:31,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-26 19:04:53,323 INFO [train.py:996] (3/4) Epoch 9, batch 28350, loss[loss=0.1907, simple_loss=0.2535, pruned_loss=0.06391, over 21843.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.28, pruned_loss=0.06283, over 4258678.44 frames. ], batch size: 98, lr: 3.21e-03, grad_scale: 16.0 2023-06-26 19:06:16,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1634082.0, ans=0.125 2023-06-26 19:06:46,260 INFO [train.py:996] (3/4) Epoch 9, batch 28400, loss[loss=0.1881, simple_loss=0.3181, pruned_loss=0.02901, over 20795.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2789, pruned_loss=0.06213, over 4259052.11 frames. ], batch size: 607, lr: 3.21e-03, grad_scale: 32.0 2023-06-26 19:07:41,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.757e+02 5.507e+02 7.639e+02 1.116e+03 2.582e+03, threshold=1.528e+03, percent-clipped=10.0 2023-06-26 19:08:33,698 INFO [train.py:996] (3/4) Epoch 9, batch 28450, loss[loss=0.2734, simple_loss=0.3441, pruned_loss=0.1014, over 21556.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2835, pruned_loss=0.06593, over 4272426.56 frames. ], batch size: 414, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:08:37,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1634442.0, ans=0.125 2023-06-26 19:09:51,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1634622.0, ans=0.2 2023-06-26 19:09:56,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-26 19:10:01,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-26 19:10:03,221 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-26 19:10:22,631 INFO [train.py:996] (3/4) Epoch 9, batch 28500, loss[loss=0.2407, simple_loss=0.3073, pruned_loss=0.08707, over 21337.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2865, pruned_loss=0.06888, over 4279261.31 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:10:42,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1634742.0, ans=0.125 2023-06-26 19:10:43,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1634802.0, ans=0.125 2023-06-26 19:11:20,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.065e+02 5.079e+02 6.899e+02 9.776e+02 2.125e+03, threshold=1.380e+03, percent-clipped=6.0 2023-06-26 19:11:36,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1634922.0, ans=0.125 2023-06-26 19:11:42,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1634922.0, ans=0.0 2023-06-26 19:11:56,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.76 vs. limit=15.0 2023-06-26 19:12:12,298 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=12.0 2023-06-26 19:12:18,032 INFO [train.py:996] (3/4) Epoch 9, batch 28550, loss[loss=0.206, simple_loss=0.2813, pruned_loss=0.06532, over 20753.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2945, pruned_loss=0.07077, over 4284755.03 frames. ], batch size: 607, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:12:51,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-26 19:13:58,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1635282.0, ans=0.0 2023-06-26 19:14:00,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1635282.0, ans=0.125 2023-06-26 19:14:03,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1635282.0, ans=0.125 2023-06-26 19:14:06,434 INFO [train.py:996] (3/4) Epoch 9, batch 28600, loss[loss=0.2339, simple_loss=0.3133, pruned_loss=0.07729, over 21506.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2999, pruned_loss=0.07218, over 4272068.52 frames. ], batch size: 194, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:14:34,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1635402.0, ans=0.125 2023-06-26 19:14:39,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1635402.0, ans=0.125 2023-06-26 19:15:10,705 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.780e+02 5.299e+02 6.853e+02 1.013e+03 2.004e+03, threshold=1.371e+03, percent-clipped=8.0 2023-06-26 19:15:25,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-26 19:16:02,345 INFO [train.py:996] (3/4) Epoch 9, batch 28650, loss[loss=0.1973, simple_loss=0.2661, pruned_loss=0.06425, over 21811.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2944, pruned_loss=0.07157, over 4278232.86 frames. ], batch size: 352, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:16:18,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1635702.0, ans=0.125 2023-06-26 19:17:02,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1635762.0, ans=0.0 2023-06-26 19:17:50,886 INFO [train.py:996] (3/4) Epoch 9, batch 28700, loss[loss=0.2435, simple_loss=0.3091, pruned_loss=0.08895, over 21775.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2938, pruned_loss=0.07317, over 4276958.97 frames. ], batch size: 441, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:18:04,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1635942.0, ans=0.1 2023-06-26 19:18:27,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1636002.0, ans=0.125 2023-06-26 19:18:48,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.582e+02 5.345e+02 7.889e+02 1.390e+03 2.918e+03, threshold=1.578e+03, percent-clipped=26.0 2023-06-26 19:19:40,201 INFO [train.py:996] (3/4) Epoch 9, batch 28750, loss[loss=0.2227, simple_loss=0.293, pruned_loss=0.07625, over 21407.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2948, pruned_loss=0.07417, over 4283195.72 frames. ], batch size: 143, lr: 3.20e-03, grad_scale: 8.0 2023-06-26 19:20:11,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1636302.0, ans=0.125 2023-06-26 19:20:28,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636362.0, ans=0.1 2023-06-26 19:20:57,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-26 19:21:21,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1636482.0, ans=0.125 2023-06-26 19:21:31,179 INFO [train.py:996] (3/4) Epoch 9, batch 28800, loss[loss=0.2395, simple_loss=0.3146, pruned_loss=0.08218, over 21641.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2985, pruned_loss=0.0747, over 4289092.46 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:21:33,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1636542.0, ans=0.04949747468305833 2023-06-26 19:21:58,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1636602.0, ans=0.0 2023-06-26 19:22:21,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636662.0, ans=0.1 2023-06-26 19:22:22,347 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-26 19:22:33,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.768e+02 5.031e+02 6.250e+02 8.713e+02 2.260e+03, threshold=1.250e+03, percent-clipped=3.0 2023-06-26 19:23:02,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1636782.0, ans=0.125 2023-06-26 19:23:08,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1636782.0, ans=0.0 2023-06-26 19:23:12,003 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-26 19:23:25,638 INFO [train.py:996] (3/4) Epoch 9, batch 28850, loss[loss=0.221, simple_loss=0.2986, pruned_loss=0.07175, over 21867.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2987, pruned_loss=0.07569, over 4290363.24 frames. ], batch size: 118, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:24:01,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1636902.0, ans=0.0 2023-06-26 19:24:49,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.06 vs. limit=22.5 2023-06-26 19:25:14,965 INFO [train.py:996] (3/4) Epoch 9, batch 28900, loss[loss=0.2769, simple_loss=0.3438, pruned_loss=0.105, over 21515.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.301, pruned_loss=0.07664, over 4294988.84 frames. ], batch size: 471, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:25:22,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1637142.0, ans=0.125 2023-06-26 19:26:03,308 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-26 19:26:12,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1637262.0, ans=0.125 2023-06-26 19:26:18,358 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.963e+02 5.862e+02 9.485e+02 1.263e+03 2.647e+03, threshold=1.897e+03, percent-clipped=25.0 2023-06-26 19:26:29,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1637322.0, ans=0.05 2023-06-26 19:27:06,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1637382.0, ans=0.2 2023-06-26 19:27:10,763 INFO [train.py:996] (3/4) Epoch 9, batch 28950, loss[loss=0.2347, simple_loss=0.3266, pruned_loss=0.07146, over 21705.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3019, pruned_loss=0.0761, over 4283683.80 frames. ], batch size: 441, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:27:35,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1637502.0, ans=0.125 2023-06-26 19:28:18,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1637622.0, ans=0.0 2023-06-26 19:28:26,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.96 vs. limit=22.5 2023-06-26 19:28:28,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.25 vs. limit=15.0 2023-06-26 19:28:57,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1637682.0, ans=0.0 2023-06-26 19:29:07,347 INFO [train.py:996] (3/4) Epoch 9, batch 29000, loss[loss=0.2429, simple_loss=0.318, pruned_loss=0.08391, over 21640.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3049, pruned_loss=0.07618, over 4275806.05 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:29:18,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1637742.0, ans=0.0 2023-06-26 19:29:19,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1637742.0, ans=0.0 2023-06-26 19:29:35,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1637802.0, ans=0.1 2023-06-26 19:29:57,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1637862.0, ans=0.0 2023-06-26 19:30:02,147 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 5.853e+02 8.491e+02 1.284e+03 2.472e+03, threshold=1.698e+03, percent-clipped=8.0 2023-06-26 19:30:41,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1637982.0, ans=0.1 2023-06-26 19:30:57,304 INFO [train.py:996] (3/4) Epoch 9, batch 29050, loss[loss=0.2553, simple_loss=0.3107, pruned_loss=0.09995, over 21786.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.305, pruned_loss=0.0767, over 4280665.46 frames. ], batch size: 508, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:31:55,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1638162.0, ans=0.5 2023-06-26 19:32:03,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1638222.0, ans=0.125 2023-06-26 19:32:45,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1638342.0, ans=0.2 2023-06-26 19:32:46,714 INFO [train.py:996] (3/4) Epoch 9, batch 29100, loss[loss=0.1943, simple_loss=0.2572, pruned_loss=0.06575, over 20133.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2973, pruned_loss=0.07441, over 4271647.00 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:32:48,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1638342.0, ans=0.125 2023-06-26 19:33:27,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1638462.0, ans=0.125 2023-06-26 19:33:29,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1638462.0, ans=0.125 2023-06-26 19:33:44,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 5.318e+02 7.274e+02 9.701e+02 2.233e+03, threshold=1.455e+03, percent-clipped=4.0 2023-06-26 19:34:12,599 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-26 19:34:25,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1638582.0, ans=0.125 2023-06-26 19:34:35,012 INFO [train.py:996] (3/4) Epoch 9, batch 29150, loss[loss=0.2504, simple_loss=0.3499, pruned_loss=0.07548, over 21228.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2955, pruned_loss=0.07282, over 4272461.46 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:35:31,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1638762.0, ans=0.2 2023-06-26 19:36:16,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1638882.0, ans=0.0 2023-06-26 19:36:16,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1638882.0, ans=0.0 2023-06-26 19:36:23,264 INFO [train.py:996] (3/4) Epoch 9, batch 29200, loss[loss=0.1793, simple_loss=0.2384, pruned_loss=0.06011, over 21162.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2908, pruned_loss=0.07169, over 4274042.47 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 32.0 2023-06-26 19:36:36,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1638942.0, ans=0.0 2023-06-26 19:36:43,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1638942.0, ans=0.0 2023-06-26 19:36:59,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1639002.0, ans=0.125 2023-06-26 19:37:28,581 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.809e+02 5.329e+02 8.203e+02 1.175e+03 2.946e+03, threshold=1.641e+03, percent-clipped=12.0 2023-06-26 19:38:11,681 INFO [train.py:996] (3/4) Epoch 9, batch 29250, loss[loss=0.1865, simple_loss=0.2731, pruned_loss=0.04999, over 21232.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2896, pruned_loss=0.06928, over 4273134.97 frames. ], batch size: 159, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:38:56,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1639362.0, ans=0.125 2023-06-26 19:39:43,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-26 19:40:05,101 INFO [train.py:996] (3/4) Epoch 9, batch 29300, loss[loss=0.1858, simple_loss=0.2529, pruned_loss=0.05937, over 21818.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2902, pruned_loss=0.06842, over 4273617.93 frames. ], batch size: 98, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:40:16,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1639542.0, ans=0.0 2023-06-26 19:40:43,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1639662.0, ans=0.125 2023-06-26 19:41:01,584 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=22.5 2023-06-26 19:41:03,790 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.751e+02 5.530e+02 7.690e+02 1.193e+03 2.293e+03, threshold=1.538e+03, percent-clipped=8.0 2023-06-26 19:41:10,754 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-26 19:41:55,412 INFO [train.py:996] (3/4) Epoch 9, batch 29350, loss[loss=0.2183, simple_loss=0.2884, pruned_loss=0.07408, over 21772.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2861, pruned_loss=0.06757, over 4275639.07 frames. ], batch size: 102, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:42:24,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1639902.0, ans=0.2 2023-06-26 19:42:25,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1639902.0, ans=10.0 2023-06-26 19:42:32,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1639902.0, ans=0.5 2023-06-26 19:42:32,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1639902.0, ans=0.125 2023-06-26 19:43:17,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1640022.0, ans=0.1 2023-06-26 19:43:47,613 INFO [train.py:996] (3/4) Epoch 9, batch 29400, loss[loss=0.1473, simple_loss=0.2155, pruned_loss=0.03956, over 21266.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2838, pruned_loss=0.06528, over 4274725.34 frames. ], batch size: 176, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:44:06,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1640142.0, ans=0.125 2023-06-26 19:44:15,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1640202.0, ans=0.0 2023-06-26 19:44:19,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1640202.0, ans=0.125 2023-06-26 19:44:43,797 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-26 19:44:53,333 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.711e+02 5.689e+02 1.066e+03 1.595e+03 4.259e+03, threshold=2.132e+03, percent-clipped=27.0 2023-06-26 19:45:22,013 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-26 19:45:28,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1640382.0, ans=0.125 2023-06-26 19:45:30,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1640382.0, ans=0.0 2023-06-26 19:45:44,115 INFO [train.py:996] (3/4) Epoch 9, batch 29450, loss[loss=0.2673, simple_loss=0.33, pruned_loss=0.1023, over 21352.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2861, pruned_loss=0.06587, over 4272959.81 frames. ], batch size: 176, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:47:13,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1640682.0, ans=0.0 2023-06-26 19:47:22,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1640682.0, ans=0.0 2023-06-26 19:47:26,961 INFO [train.py:996] (3/4) Epoch 9, batch 29500, loss[loss=0.2049, simple_loss=0.2779, pruned_loss=0.06597, over 21925.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2903, pruned_loss=0.06864, over 4277820.12 frames. ], batch size: 333, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:48:00,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1640802.0, ans=0.125 2023-06-26 19:48:08,172 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-06-26 19:48:30,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.856e+02 6.070e+02 8.083e+02 1.104e+03 1.958e+03, threshold=1.617e+03, percent-clipped=0.0 2023-06-26 19:48:41,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1640922.0, ans=0.125 2023-06-26 19:48:58,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1640982.0, ans=0.125 2023-06-26 19:49:14,785 INFO [train.py:996] (3/4) Epoch 9, batch 29550, loss[loss=0.1878, simple_loss=0.2549, pruned_loss=0.06034, over 21168.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2902, pruned_loss=0.07059, over 4288029.08 frames. ], batch size: 608, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:49:34,398 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2023-06-26 19:49:35,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1641042.0, ans=0.125 2023-06-26 19:49:49,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1641102.0, ans=0.125 2023-06-26 19:49:55,062 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.23 vs. limit=10.0 2023-06-26 19:51:07,253 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.52 vs. limit=15.0 2023-06-26 19:51:11,582 INFO [train.py:996] (3/4) Epoch 9, batch 29600, loss[loss=0.2673, simple_loss=0.3585, pruned_loss=0.08801, over 21728.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2978, pruned_loss=0.07279, over 4290563.51 frames. ], batch size: 351, lr: 3.20e-03, grad_scale: 32.0 2023-06-26 19:51:38,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1641402.0, ans=0.0 2023-06-26 19:52:15,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1641462.0, ans=0.125 2023-06-26 19:52:16,239 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 6.278e+02 9.739e+02 1.305e+03 2.412e+03, threshold=1.948e+03, percent-clipped=12.0 2023-06-26 19:53:00,018 INFO [train.py:996] (3/4) Epoch 9, batch 29650, loss[loss=0.1933, simple_loss=0.3159, pruned_loss=0.03531, over 19799.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2939, pruned_loss=0.06875, over 4284710.17 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:54:49,514 INFO [train.py:996] (3/4) Epoch 9, batch 29700, loss[loss=0.2316, simple_loss=0.3323, pruned_loss=0.0654, over 21433.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2954, pruned_loss=0.06881, over 4287367.69 frames. ], batch size: 211, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:55:43,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1642062.0, ans=0.125 2023-06-26 19:55:55,268 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.514e+02 4.988e+02 7.625e+02 1.121e+03 2.201e+03, threshold=1.525e+03, percent-clipped=1.0 2023-06-26 19:56:33,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1642182.0, ans=0.95 2023-06-26 19:56:38,149 INFO [train.py:996] (3/4) Epoch 9, batch 29750, loss[loss=0.2244, simple_loss=0.317, pruned_loss=0.06588, over 21442.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.3004, pruned_loss=0.06859, over 4282112.79 frames. ], batch size: 194, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:57:04,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1642302.0, ans=0.125 2023-06-26 19:57:11,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1642302.0, ans=0.0 2023-06-26 19:57:28,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1642362.0, ans=0.0 2023-06-26 19:58:14,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1642482.0, ans=6.0 2023-06-26 19:58:22,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1642482.0, ans=0.0 2023-06-26 19:58:26,736 INFO [train.py:996] (3/4) Epoch 9, batch 29800, loss[loss=0.2175, simple_loss=0.2907, pruned_loss=0.07211, over 21758.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3024, pruned_loss=0.06985, over 4292284.70 frames. ], batch size: 389, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 19:59:33,429 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.208e+02 7.577e+02 1.107e+03 1.626e+03 2.906e+03, threshold=2.213e+03, percent-clipped=30.0 2023-06-26 19:59:59,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1642782.0, ans=0.125 2023-06-26 20:00:15,158 INFO [train.py:996] (3/4) Epoch 9, batch 29850, loss[loss=0.2053, simple_loss=0.2815, pruned_loss=0.06458, over 21850.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2989, pruned_loss=0.06763, over 4285985.90 frames. ], batch size: 118, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:00:24,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1642842.0, ans=0.0 2023-06-26 20:00:38,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1642902.0, ans=0.125 2023-06-26 20:00:43,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1642902.0, ans=0.125 2023-06-26 20:01:16,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1642962.0, ans=0.0 2023-06-26 20:02:08,048 INFO [train.py:996] (3/4) Epoch 9, batch 29900, loss[loss=0.237, simple_loss=0.3049, pruned_loss=0.08454, over 21398.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2964, pruned_loss=0.06866, over 4287774.73 frames. ], batch size: 176, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:02:10,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1643142.0, ans=0.1 2023-06-26 20:02:48,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1643262.0, ans=0.2 2023-06-26 20:03:01,610 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:03:09,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.950e+02 5.576e+02 8.031e+02 1.172e+03 2.675e+03, threshold=1.606e+03, percent-clipped=3.0 2023-06-26 20:03:32,385 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:03:57,894 INFO [train.py:996] (3/4) Epoch 9, batch 29950, loss[loss=0.2876, simple_loss=0.3473, pruned_loss=0.1139, over 21470.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3, pruned_loss=0.0721, over 4288034.34 frames. ], batch size: 471, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:04:02,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1643442.0, ans=0.0 2023-06-26 20:05:18,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1643622.0, ans=0.125 2023-06-26 20:05:34,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-26 20:05:54,875 INFO [train.py:996] (3/4) Epoch 9, batch 30000, loss[loss=0.2377, simple_loss=0.3299, pruned_loss=0.07271, over 21466.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3026, pruned_loss=0.0725, over 4287036.29 frames. ], batch size: 471, lr: 3.20e-03, grad_scale: 32.0 2023-06-26 20:05:54,876 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 20:06:15,950 INFO [train.py:1028] (3/4) Epoch 9, validation: loss=0.2518, simple_loss=0.3443, pruned_loss=0.07961, over 1796401.00 frames. 2023-06-26 20:06:15,951 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 20:06:48,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1643802.0, ans=0.0 2023-06-26 20:07:11,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1643862.0, ans=0.2 2023-06-26 20:07:22,150 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.541e+02 6.693e+02 9.863e+02 1.324e+03 2.517e+03, threshold=1.973e+03, percent-clipped=14.0 2023-06-26 20:08:09,886 INFO [train.py:996] (3/4) Epoch 9, batch 30050, loss[loss=0.1273, simple_loss=0.1717, pruned_loss=0.04143, over 16143.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3036, pruned_loss=0.06945, over 4273853.42 frames. ], batch size: 60, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:08:42,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1644102.0, ans=0.0 2023-06-26 20:08:56,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1644162.0, ans=0.125 2023-06-26 20:08:58,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1644162.0, ans=0.125 2023-06-26 20:09:17,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-26 20:09:24,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1644222.0, ans=0.1 2023-06-26 20:09:29,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1644222.0, ans=0.025 2023-06-26 20:10:03,821 INFO [train.py:996] (3/4) Epoch 9, batch 30100, loss[loss=0.1976, simple_loss=0.2588, pruned_loss=0.06815, over 21513.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3024, pruned_loss=0.06952, over 4273179.95 frames. ], batch size: 195, lr: 3.20e-03, grad_scale: 16.0 2023-06-26 20:11:07,511 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.020e+02 5.633e+02 9.341e+02 1.482e+03 2.871e+03, threshold=1.868e+03, percent-clipped=12.0 2023-06-26 20:11:53,758 INFO [train.py:996] (3/4) Epoch 9, batch 30150, loss[loss=0.2549, simple_loss=0.3223, pruned_loss=0.09369, over 21560.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.299, pruned_loss=0.07117, over 4275883.27 frames. ], batch size: 415, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:11:58,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=12.0 2023-06-26 20:12:32,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1644702.0, ans=0.125 2023-06-26 20:12:59,902 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-26 20:13:23,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1644822.0, ans=0.0 2023-06-26 20:13:50,898 INFO [train.py:996] (3/4) Epoch 9, batch 30200, loss[loss=0.2323, simple_loss=0.3389, pruned_loss=0.06281, over 21636.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3015, pruned_loss=0.07019, over 4280874.58 frames. ], batch size: 414, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:13:51,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1644942.0, ans=0.125 2023-06-26 20:14:35,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1645002.0, ans=0.1 2023-06-26 20:14:38,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1645062.0, ans=0.125 2023-06-26 20:14:51,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1645062.0, ans=0.0 2023-06-26 20:15:01,601 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.601e+02 6.016e+02 8.945e+02 1.496e+03 2.296e+03, threshold=1.789e+03, percent-clipped=11.0 2023-06-26 20:15:13,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1645122.0, ans=0.1 2023-06-26 20:15:25,545 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-26 20:15:42,561 INFO [train.py:996] (3/4) Epoch 9, batch 30250, loss[loss=0.3159, simple_loss=0.4042, pruned_loss=0.1138, over 21677.00 frames. ], tot_loss[loss=0.228, simple_loss=0.31, pruned_loss=0.07301, over 4275392.98 frames. ], batch size: 441, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:16:11,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1645302.0, ans=0.125 2023-06-26 20:16:17,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1645302.0, ans=0.125 2023-06-26 20:16:24,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.44 vs. limit=15.0 2023-06-26 20:16:31,477 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-26 20:17:10,369 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-26 20:17:29,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1645482.0, ans=0.0 2023-06-26 20:17:37,238 INFO [train.py:996] (3/4) Epoch 9, batch 30300, loss[loss=0.1902, simple_loss=0.2542, pruned_loss=0.06307, over 20709.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.306, pruned_loss=0.0718, over 4272543.82 frames. ], batch size: 607, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:18:11,022 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=22.5 2023-06-26 20:18:17,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1645662.0, ans=0.1 2023-06-26 20:18:46,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1645722.0, ans=0.125 2023-06-26 20:18:47,512 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.854e+02 6.241e+02 9.150e+02 1.357e+03 2.520e+03, threshold=1.830e+03, percent-clipped=12.0 2023-06-26 20:18:55,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1645722.0, ans=0.1 2023-06-26 20:18:59,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1645722.0, ans=0.2 2023-06-26 20:19:01,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1645722.0, ans=0.1 2023-06-26 20:19:05,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1645782.0, ans=0.2 2023-06-26 20:19:06,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1645782.0, ans=0.1 2023-06-26 20:19:35,119 INFO [train.py:996] (3/4) Epoch 9, batch 30350, loss[loss=0.2295, simple_loss=0.3125, pruned_loss=0.07321, over 21692.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3077, pruned_loss=0.07335, over 4272955.46 frames. ], batch size: 298, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:20:32,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1646022.0, ans=0.125 2023-06-26 20:20:58,736 INFO [train.py:996] (3/4) Epoch 9, batch 30400, loss[loss=0.2089, simple_loss=0.2562, pruned_loss=0.08077, over 20343.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3017, pruned_loss=0.07186, over 4265601.25 frames. ], batch size: 703, lr: 3.19e-03, grad_scale: 32.0 2023-06-26 20:21:15,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1646142.0, ans=0.0 2023-06-26 20:21:48,850 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:21:55,168 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.123e+02 6.385e+02 9.749e+02 1.472e+03 9.200e+03, threshold=1.950e+03, percent-clipped=15.0 2023-06-26 20:22:00,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1646322.0, ans=0.0 2023-06-26 20:22:29,097 INFO [train.py:996] (3/4) Epoch 9, batch 30450, loss[loss=0.267, simple_loss=0.3899, pruned_loss=0.07202, over 19931.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.302, pruned_loss=0.07147, over 4205224.53 frames. ], batch size: 702, lr: 3.19e-03, grad_scale: 16.0 2023-06-26 20:22:54,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-26 20:23:03,867 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=22.5 2023-06-26 20:25:55,219 INFO [train.py:996] (3/4) Epoch 10, batch 0, loss[loss=0.2067, simple_loss=0.2777, pruned_loss=0.06778, over 21866.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2777, pruned_loss=0.06778, over 21866.00 frames. ], batch size: 373, lr: 3.02e-03, grad_scale: 32.0 2023-06-26 20:25:55,219 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 20:26:11,822 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2437, simple_loss=0.3472, pruned_loss=0.0701, over 1796401.00 frames. 2023-06-26 20:26:11,823 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 20:26:29,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1646712.0, ans=0.125 2023-06-26 20:26:44,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1646772.0, ans=0.0 2023-06-26 20:27:01,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1646832.0, ans=0.015 2023-06-26 20:27:26,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-26 20:27:27,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1646892.0, ans=0.125 2023-06-26 20:27:35,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.973e+02 1.183e+03 2.082e+03 3.728e+03 9.226e+03, threshold=4.165e+03, percent-clipped=55.0 2023-06-26 20:27:57,601 INFO [train.py:996] (3/4) Epoch 10, batch 50, loss[loss=0.2094, simple_loss=0.3051, pruned_loss=0.05688, over 21589.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3093, pruned_loss=0.07188, over 962448.04 frames. ], batch size: 230, lr: 3.02e-03, grad_scale: 16.0 2023-06-26 20:28:08,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1647012.0, ans=0.2 2023-06-26 20:28:10,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1647012.0, ans=0.125 2023-06-26 20:29:25,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1647252.0, ans=0.125 2023-06-26 20:29:34,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1647252.0, ans=0.0 2023-06-26 20:29:44,270 INFO [train.py:996] (3/4) Epoch 10, batch 100, loss[loss=0.2234, simple_loss=0.323, pruned_loss=0.06188, over 20717.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3236, pruned_loss=0.0742, over 1699734.69 frames. ], batch size: 607, lr: 3.02e-03, grad_scale: 16.0 2023-06-26 20:30:08,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1647372.0, ans=0.2 2023-06-26 20:31:06,600 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.835e+02 5.191e+02 6.971e+02 9.608e+02 1.975e+03, threshold=1.394e+03, percent-clipped=0.0 2023-06-26 20:31:26,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1647552.0, ans=0.04949747468305833 2023-06-26 20:31:28,449 INFO [train.py:996] (3/4) Epoch 10, batch 150, loss[loss=0.2319, simple_loss=0.3212, pruned_loss=0.07129, over 21631.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3254, pruned_loss=0.07441, over 2271173.05 frames. ], batch size: 230, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:32:50,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=1647792.0, ans=0.2 2023-06-26 20:33:02,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=15.0 2023-06-26 20:33:14,179 INFO [train.py:996] (3/4) Epoch 10, batch 200, loss[loss=0.2159, simple_loss=0.2894, pruned_loss=0.07121, over 21921.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3207, pruned_loss=0.07219, over 2720755.18 frames. ], batch size: 316, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:33:21,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1647912.0, ans=0.2 2023-06-26 20:33:43,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1647972.0, ans=0.125 2023-06-26 20:34:39,712 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.906e+02 5.339e+02 8.333e+02 1.175e+03 2.265e+03, threshold=1.667e+03, percent-clipped=16.0 2023-06-26 20:35:00,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1648212.0, ans=0.125 2023-06-26 20:35:01,880 INFO [train.py:996] (3/4) Epoch 10, batch 250, loss[loss=0.1908, simple_loss=0.26, pruned_loss=0.06079, over 22039.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.314, pruned_loss=0.07153, over 3070132.65 frames. ], batch size: 103, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:36:36,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1648452.0, ans=0.125 2023-06-26 20:36:54,053 INFO [train.py:996] (3/4) Epoch 10, batch 300, loss[loss=0.218, simple_loss=0.2832, pruned_loss=0.07637, over 20000.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3088, pruned_loss=0.07177, over 3333792.15 frames. ], batch size: 702, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:38:17,614 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.698e+02 5.791e+02 8.130e+02 1.304e+03 2.175e+03, threshold=1.626e+03, percent-clipped=9.0 2023-06-26 20:38:23,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1648752.0, ans=0.2 2023-06-26 20:38:27,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1648752.0, ans=0.125 2023-06-26 20:38:40,490 INFO [train.py:996] (3/4) Epoch 10, batch 350, loss[loss=0.23, simple_loss=0.3199, pruned_loss=0.07006, over 21729.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2991, pruned_loss=0.06985, over 3539623.86 frames. ], batch size: 118, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:39:47,085 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:39:51,331 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=26.61 vs. limit=22.5 2023-06-26 20:40:22,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1649052.0, ans=0.1 2023-06-26 20:40:24,671 INFO [train.py:996] (3/4) Epoch 10, batch 400, loss[loss=0.1913, simple_loss=0.2944, pruned_loss=0.04409, over 21803.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2923, pruned_loss=0.06759, over 3685826.03 frames. ], batch size: 371, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 20:40:41,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1649112.0, ans=0.025 2023-06-26 20:41:16,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1649232.0, ans=0.125 2023-06-26 20:41:41,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1649292.0, ans=0.0 2023-06-26 20:41:41,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1649292.0, ans=0.125 2023-06-26 20:41:53,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.780e+02 7.996e+02 1.335e+03 1.838e+03 3.332e+03, threshold=2.670e+03, percent-clipped=35.0 2023-06-26 20:41:55,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1649352.0, ans=0.2 2023-06-26 20:42:11,281 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:42:14,154 INFO [train.py:996] (3/4) Epoch 10, batch 450, loss[loss=0.2583, simple_loss=0.3714, pruned_loss=0.07264, over 21721.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2903, pruned_loss=0.06646, over 3816566.44 frames. ], batch size: 414, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:42:23,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1649412.0, ans=0.1 2023-06-26 20:43:17,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1649532.0, ans=0.0 2023-06-26 20:43:21,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1649532.0, ans=0.125 2023-06-26 20:43:44,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1649652.0, ans=0.125 2023-06-26 20:43:51,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1649652.0, ans=0.0 2023-06-26 20:43:59,393 INFO [train.py:996] (3/4) Epoch 10, batch 500, loss[loss=0.1814, simple_loss=0.2599, pruned_loss=0.05142, over 21657.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2908, pruned_loss=0.06566, over 3922648.29 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:44:12,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1649712.0, ans=0.125 2023-06-26 20:44:29,720 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.57 vs. limit=10.0 2023-06-26 20:44:33,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1649772.0, ans=15.0 2023-06-26 20:45:24,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 9.001e+02 1.327e+03 2.089e+03 4.282e+03, threshold=2.653e+03, percent-clipped=10.0 2023-06-26 20:45:51,432 INFO [train.py:996] (3/4) Epoch 10, batch 550, loss[loss=0.1762, simple_loss=0.2263, pruned_loss=0.06299, over 19958.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2942, pruned_loss=0.06597, over 4001202.77 frames. ], batch size: 704, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:46:11,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-26 20:46:14,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1650072.0, ans=0.04949747468305833 2023-06-26 20:46:34,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-26 20:46:53,115 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:46:53,682 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=22.5 2023-06-26 20:47:23,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1650252.0, ans=0.1 2023-06-26 20:47:33,150 INFO [train.py:996] (3/4) Epoch 10, batch 600, loss[loss=0.2498, simple_loss=0.3534, pruned_loss=0.07317, over 21705.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2975, pruned_loss=0.06646, over 4070131.41 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:47:33,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1650312.0, ans=0.125 2023-06-26 20:47:36,113 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-26 20:48:58,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.973e+02 6.857e+02 1.039e+03 1.439e+03 2.641e+03, threshold=2.079e+03, percent-clipped=0.0 2023-06-26 20:49:13,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1650552.0, ans=0.125 2023-06-26 20:49:19,453 INFO [train.py:996] (3/4) Epoch 10, batch 650, loss[loss=0.2261, simple_loss=0.3043, pruned_loss=0.07399, over 21898.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.3009, pruned_loss=0.0671, over 4115644.18 frames. ], batch size: 124, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:49:25,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1650612.0, ans=0.125 2023-06-26 20:49:26,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1650612.0, ans=0.0 2023-06-26 20:49:57,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1650672.0, ans=0.125 2023-06-26 20:50:27,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1650792.0, ans=0.125 2023-06-26 20:50:35,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1650792.0, ans=0.0 2023-06-26 20:50:54,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1650852.0, ans=0.2 2023-06-26 20:50:56,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1650852.0, ans=0.1 2023-06-26 20:51:00,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-06-26 20:51:00,881 INFO [train.py:996] (3/4) Epoch 10, batch 700, loss[loss=0.2042, simple_loss=0.272, pruned_loss=0.06826, over 21350.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.3025, pruned_loss=0.06763, over 4153560.50 frames. ], batch size: 159, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:51:40,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1650972.0, ans=0.1 2023-06-26 20:51:45,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1651032.0, ans=0.125 2023-06-26 20:52:02,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1651032.0, ans=0.125 2023-06-26 20:52:26,594 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 6.227e+02 9.890e+02 1.482e+03 2.866e+03, threshold=1.978e+03, percent-clipped=9.0 2023-06-26 20:52:46,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1651212.0, ans=0.2 2023-06-26 20:52:46,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1651212.0, ans=15.0 2023-06-26 20:52:47,470 INFO [train.py:996] (3/4) Epoch 10, batch 750, loss[loss=0.2661, simple_loss=0.3914, pruned_loss=0.0704, over 19709.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.3003, pruned_loss=0.06751, over 4183859.81 frames. ], batch size: 702, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:52:48,212 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:52:48,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1651212.0, ans=0.125 2023-06-26 20:53:08,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1651272.0, ans=0.125 2023-06-26 20:53:09,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1651272.0, ans=0.125 2023-06-26 20:53:18,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1651272.0, ans=0.035 2023-06-26 20:53:32,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1651332.0, ans=0.125 2023-06-26 20:53:42,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1651332.0, ans=0.2 2023-06-26 20:54:07,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1651392.0, ans=0.125 2023-06-26 20:54:11,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1651392.0, ans=0.125 2023-06-26 20:54:35,016 INFO [train.py:996] (3/4) Epoch 10, batch 800, loss[loss=0.1755, simple_loss=0.2456, pruned_loss=0.05271, over 21624.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2954, pruned_loss=0.06772, over 4208031.66 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 20:54:41,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1651512.0, ans=0.125 2023-06-26 20:55:28,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1651632.0, ans=0.125 2023-06-26 20:55:38,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1651632.0, ans=0.125 2023-06-26 20:55:39,308 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-26 20:55:41,186 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2023-06-26 20:56:04,638 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.663e+02 5.824e+02 9.070e+02 1.319e+03 2.505e+03, threshold=1.814e+03, percent-clipped=4.0 2023-06-26 20:56:23,631 INFO [train.py:996] (3/4) Epoch 10, batch 850, loss[loss=0.2013, simple_loss=0.2715, pruned_loss=0.06557, over 21757.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2924, pruned_loss=0.06813, over 4227895.61 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:56:45,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1651872.0, ans=0.125 2023-06-26 20:57:47,879 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 20:57:49,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1651992.0, ans=0.0 2023-06-26 20:57:52,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1651992.0, ans=0.0 2023-06-26 20:58:18,436 INFO [train.py:996] (3/4) Epoch 10, batch 900, loss[loss=0.1836, simple_loss=0.2648, pruned_loss=0.05121, over 21057.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2912, pruned_loss=0.06804, over 4233691.56 frames. ], batch size: 143, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 20:58:23,139 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=22.5 2023-06-26 20:58:24,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1652112.0, ans=0.125 2023-06-26 20:58:46,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1652172.0, ans=0.1 2023-06-26 20:59:09,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1652232.0, ans=0.125 2023-06-26 20:59:42,409 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.674e+02 4.955e+02 6.528e+02 1.022e+03 3.124e+03, threshold=1.306e+03, percent-clipped=4.0 2023-06-26 20:59:59,955 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-26 21:00:07,560 INFO [train.py:996] (3/4) Epoch 10, batch 950, loss[loss=0.1982, simple_loss=0.2594, pruned_loss=0.06848, over 20342.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2895, pruned_loss=0.06868, over 4249659.62 frames. ], batch size: 703, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:00:11,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1652412.0, ans=0.125 2023-06-26 21:00:22,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1652412.0, ans=0.125 2023-06-26 21:00:59,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1652532.0, ans=0.125 2023-06-26 21:01:56,961 INFO [train.py:996] (3/4) Epoch 10, batch 1000, loss[loss=0.2398, simple_loss=0.3132, pruned_loss=0.0832, over 21777.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2883, pruned_loss=0.06896, over 4257791.97 frames. ], batch size: 441, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:02:15,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1652712.0, ans=0.125 2023-06-26 21:02:29,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1652772.0, ans=0.1 2023-06-26 21:03:31,697 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.126e+02 7.237e+02 1.217e+03 1.852e+03 3.276e+03, threshold=2.433e+03, percent-clipped=47.0 2023-06-26 21:03:51,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1652952.0, ans=0.125 2023-06-26 21:03:56,354 INFO [train.py:996] (3/4) Epoch 10, batch 1050, loss[loss=0.192, simple_loss=0.2774, pruned_loss=0.05328, over 21395.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2881, pruned_loss=0.06864, over 4267653.45 frames. ], batch size: 194, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:05:46,778 INFO [train.py:996] (3/4) Epoch 10, batch 1100, loss[loss=0.2345, simple_loss=0.3122, pruned_loss=0.07846, over 21693.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2895, pruned_loss=0.06796, over 4276398.66 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:06:04,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=22.5 2023-06-26 21:06:08,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1653372.0, ans=0.2 2023-06-26 21:06:24,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-26 21:06:44,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1653432.0, ans=0.125 2023-06-26 21:06:57,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1653492.0, ans=0.0 2023-06-26 21:06:57,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-26 21:07:14,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.610e+02 5.858e+02 8.624e+02 1.218e+03 2.996e+03, threshold=1.725e+03, percent-clipped=2.0 2023-06-26 21:07:25,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1653552.0, ans=0.0 2023-06-26 21:07:38,288 INFO [train.py:996] (3/4) Epoch 10, batch 1150, loss[loss=0.1924, simple_loss=0.2444, pruned_loss=0.07024, over 20709.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2883, pruned_loss=0.0665, over 4282387.26 frames. ], batch size: 608, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:07:58,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1653612.0, ans=0.125 2023-06-26 21:08:06,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-26 21:09:03,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1653792.0, ans=0.0 2023-06-26 21:09:04,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-26 21:09:32,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1653852.0, ans=0.1 2023-06-26 21:09:36,655 INFO [train.py:996] (3/4) Epoch 10, batch 1200, loss[loss=0.2346, simple_loss=0.3077, pruned_loss=0.08071, over 21442.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2918, pruned_loss=0.06779, over 4289394.01 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:09:49,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1653912.0, ans=0.1 2023-06-26 21:09:56,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1653912.0, ans=0.125 2023-06-26 21:11:00,198 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.823e+02 5.719e+02 8.661e+02 1.239e+03 3.080e+03, threshold=1.732e+03, percent-clipped=10.0 2023-06-26 21:11:10,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1654152.0, ans=0.0 2023-06-26 21:11:25,979 INFO [train.py:996] (3/4) Epoch 10, batch 1250, loss[loss=0.2138, simple_loss=0.2875, pruned_loss=0.07009, over 21496.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2932, pruned_loss=0.06774, over 4293721.23 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:12:21,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1654332.0, ans=0.0 2023-06-26 21:12:58,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1654452.0, ans=0.125 2023-06-26 21:13:16,668 INFO [train.py:996] (3/4) Epoch 10, batch 1300, loss[loss=0.2026, simple_loss=0.2854, pruned_loss=0.05991, over 21473.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2943, pruned_loss=0.06724, over 4286417.31 frames. ], batch size: 131, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:14:13,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=22.5 2023-06-26 21:14:43,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.194e+02 7.398e+02 1.015e+03 1.513e+03 3.841e+03, threshold=2.029e+03, percent-clipped=13.0 2023-06-26 21:15:06,107 INFO [train.py:996] (3/4) Epoch 10, batch 1350, loss[loss=0.2432, simple_loss=0.3234, pruned_loss=0.08144, over 21591.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2956, pruned_loss=0.0682, over 4292398.72 frames. ], batch size: 414, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:16:25,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1654992.0, ans=0.02 2023-06-26 21:16:52,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1655052.0, ans=0.125 2023-06-26 21:17:00,106 INFO [train.py:996] (3/4) Epoch 10, batch 1400, loss[loss=0.2125, simple_loss=0.2817, pruned_loss=0.07166, over 21416.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2941, pruned_loss=0.06835, over 4278663.12 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:17:02,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1655112.0, ans=0.025 2023-06-26 21:17:26,913 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-26 21:17:59,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1655292.0, ans=0.2 2023-06-26 21:18:12,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1655292.0, ans=0.125 2023-06-26 21:18:25,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.941e+02 5.863e+02 9.944e+02 1.473e+03 3.016e+03, threshold=1.989e+03, percent-clipped=13.0 2023-06-26 21:18:48,225 INFO [train.py:996] (3/4) Epoch 10, batch 1450, loss[loss=0.2366, simple_loss=0.314, pruned_loss=0.07955, over 21866.00 frames. ], tot_loss[loss=0.218, simple_loss=0.297, pruned_loss=0.06945, over 4281836.16 frames. ], batch size: 316, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:18:59,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1655412.0, ans=0.04949747468305833 2023-06-26 21:20:36,873 INFO [train.py:996] (3/4) Epoch 10, batch 1500, loss[loss=0.2032, simple_loss=0.2805, pruned_loss=0.06298, over 21943.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2966, pruned_loss=0.07024, over 4289554.38 frames. ], batch size: 333, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:20:47,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1655712.0, ans=0.125 2023-06-26 21:21:29,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1655832.0, ans=0.1 2023-06-26 21:22:03,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.679e+02 5.579e+02 7.007e+02 1.027e+03 2.656e+03, threshold=1.401e+03, percent-clipped=4.0 2023-06-26 21:22:29,780 INFO [train.py:996] (3/4) Epoch 10, batch 1550, loss[loss=0.208, simple_loss=0.2845, pruned_loss=0.06576, over 21624.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2948, pruned_loss=0.0699, over 4295957.07 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:23:08,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1656132.0, ans=0.125 2023-06-26 21:24:04,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1656252.0, ans=0.125 2023-06-26 21:24:18,728 INFO [train.py:996] (3/4) Epoch 10, batch 1600, loss[loss=0.2192, simple_loss=0.2906, pruned_loss=0.07388, over 21543.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2946, pruned_loss=0.07016, over 4292871.71 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:25:14,527 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-26 21:25:35,467 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-06-26 21:25:50,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.007e+02 6.112e+02 1.058e+03 1.502e+03 3.121e+03, threshold=2.116e+03, percent-clipped=30.0 2023-06-26 21:25:59,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1656552.0, ans=0.1 2023-06-26 21:26:07,887 INFO [train.py:996] (3/4) Epoch 10, batch 1650, loss[loss=0.1856, simple_loss=0.2453, pruned_loss=0.06292, over 21283.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2911, pruned_loss=0.06904, over 4293659.79 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 32.0 2023-06-26 21:26:26,979 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-26 21:27:03,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1656732.0, ans=0.0 2023-06-26 21:27:49,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1656852.0, ans=0.1 2023-06-26 21:28:04,344 INFO [train.py:996] (3/4) Epoch 10, batch 1700, loss[loss=0.2145, simple_loss=0.2711, pruned_loss=0.07898, over 21256.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2927, pruned_loss=0.06915, over 4297915.65 frames. ], batch size: 471, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:28:40,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1656972.0, ans=0.0 2023-06-26 21:28:42,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1656972.0, ans=0.1 2023-06-26 21:29:40,240 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.858e+02 6.519e+02 9.043e+02 1.348e+03 2.914e+03, threshold=1.809e+03, percent-clipped=3.0 2023-06-26 21:29:56,219 INFO [train.py:996] (3/4) Epoch 10, batch 1750, loss[loss=0.2324, simple_loss=0.3086, pruned_loss=0.07812, over 21414.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2948, pruned_loss=0.06836, over 4290400.09 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 16.0 2023-06-26 21:31:54,519 INFO [train.py:996] (3/4) Epoch 10, batch 1800, loss[loss=0.1724, simple_loss=0.2528, pruned_loss=0.04601, over 21375.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2929, pruned_loss=0.06705, over 4280383.70 frames. ], batch size: 211, lr: 3.01e-03, grad_scale: 8.0 2023-06-26 21:32:20,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1657572.0, ans=0.0 2023-06-26 21:32:29,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1657572.0, ans=0.1 2023-06-26 21:32:40,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1657632.0, ans=0.125 2023-06-26 21:32:42,430 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 21:33:24,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.986e+02 5.658e+02 9.190e+02 1.767e+03 4.020e+03, threshold=1.838e+03, percent-clipped=23.0 2023-06-26 21:33:25,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1657752.0, ans=0.125 2023-06-26 21:33:29,644 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-26 21:33:44,349 INFO [train.py:996] (3/4) Epoch 10, batch 1850, loss[loss=0.2067, simple_loss=0.298, pruned_loss=0.0577, over 21247.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2949, pruned_loss=0.06628, over 4280802.94 frames. ], batch size: 549, lr: 3.01e-03, grad_scale: 8.0 2023-06-26 21:34:28,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1657932.0, ans=0.125 2023-06-26 21:34:52,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1657992.0, ans=0.125 2023-06-26 21:35:26,461 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-26 21:35:27,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1658052.0, ans=0.125 2023-06-26 21:35:32,232 INFO [train.py:996] (3/4) Epoch 10, batch 1900, loss[loss=0.2467, simple_loss=0.3441, pruned_loss=0.07466, over 21528.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2957, pruned_loss=0.06664, over 4287837.90 frames. ], batch size: 471, lr: 3.01e-03, grad_scale: 8.0 2023-06-26 21:36:35,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-26 21:37:08,214 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.968e+02 6.601e+02 8.691e+02 1.330e+03 2.480e+03, threshold=1.738e+03, percent-clipped=9.0 2023-06-26 21:37:21,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1658412.0, ans=0.1 2023-06-26 21:37:22,013 INFO [train.py:996] (3/4) Epoch 10, batch 1950, loss[loss=0.2093, simple_loss=0.2727, pruned_loss=0.073, over 21860.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2917, pruned_loss=0.06681, over 4288944.01 frames. ], batch size: 107, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:37:35,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1658412.0, ans=0.1 2023-06-26 21:38:35,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1658592.0, ans=0.1 2023-06-26 21:39:06,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1658652.0, ans=0.1 2023-06-26 21:39:11,006 INFO [train.py:996] (3/4) Epoch 10, batch 2000, loss[loss=0.1907, simple_loss=0.2874, pruned_loss=0.04698, over 21618.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2865, pruned_loss=0.06464, over 4286065.08 frames. ], batch size: 263, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:39:20,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1658712.0, ans=0.125 2023-06-26 21:40:14,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.97 vs. limit=5.0 2023-06-26 21:40:17,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1658832.0, ans=0.2 2023-06-26 21:40:26,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=12.0 2023-06-26 21:40:29,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1658892.0, ans=0.125 2023-06-26 21:40:46,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 7.434e+02 1.051e+03 1.825e+03 4.116e+03, threshold=2.102e+03, percent-clipped=26.0 2023-06-26 21:40:46,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1658952.0, ans=0.125 2023-06-26 21:41:00,340 INFO [train.py:996] (3/4) Epoch 10, batch 2050, loss[loss=0.2141, simple_loss=0.3107, pruned_loss=0.05875, over 21704.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2903, pruned_loss=0.06478, over 4287426.94 frames. ], batch size: 351, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:41:18,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1659012.0, ans=0.125 2023-06-26 21:41:57,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1659132.0, ans=0.125 2023-06-26 21:42:06,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1659192.0, ans=0.0 2023-06-26 21:42:31,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1659252.0, ans=0.125 2023-06-26 21:42:31,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1659252.0, ans=0.125 2023-06-26 21:42:40,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-26 21:42:41,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1659252.0, ans=0.2 2023-06-26 21:42:53,048 INFO [train.py:996] (3/4) Epoch 10, batch 2100, loss[loss=0.2067, simple_loss=0.2875, pruned_loss=0.06299, over 21734.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2961, pruned_loss=0.0668, over 4282184.36 frames. ], batch size: 112, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:44:22,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.975e+02 6.453e+02 1.021e+03 1.329e+03 2.280e+03, threshold=2.042e+03, percent-clipped=5.0 2023-06-26 21:44:41,179 INFO [train.py:996] (3/4) Epoch 10, batch 2150, loss[loss=0.2056, simple_loss=0.2928, pruned_loss=0.05914, over 21180.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.297, pruned_loss=0.06804, over 4272714.77 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:45:41,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1659732.0, ans=0.125 2023-06-26 21:46:05,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1659792.0, ans=0.0 2023-06-26 21:46:29,871 INFO [train.py:996] (3/4) Epoch 10, batch 2200, loss[loss=0.1759, simple_loss=0.2581, pruned_loss=0.04687, over 21357.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2972, pruned_loss=0.06801, over 4269226.42 frames. ], batch size: 194, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:46:32,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1659912.0, ans=0.0 2023-06-26 21:46:32,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1659912.0, ans=0.1 2023-06-26 21:46:57,762 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-26 21:47:27,640 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-26 21:47:35,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1660032.0, ans=0.0 2023-06-26 21:47:37,879 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-26 21:48:00,546 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.021e+02 5.718e+02 8.930e+02 1.284e+03 2.710e+03, threshold=1.786e+03, percent-clipped=5.0 2023-06-26 21:48:15,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-26 21:48:17,690 INFO [train.py:996] (3/4) Epoch 10, batch 2250, loss[loss=0.1936, simple_loss=0.3081, pruned_loss=0.03961, over 20751.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2937, pruned_loss=0.0656, over 4268340.07 frames. ], batch size: 608, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:48:20,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1660212.0, ans=0.0 2023-06-26 21:48:59,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1660272.0, ans=0.125 2023-06-26 21:49:02,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1660332.0, ans=0.0 2023-06-26 21:49:41,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1660452.0, ans=0.125 2023-06-26 21:49:41,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-26 21:49:42,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-26 21:50:04,939 INFO [train.py:996] (3/4) Epoch 10, batch 2300, loss[loss=0.19, simple_loss=0.2662, pruned_loss=0.05684, over 21797.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2877, pruned_loss=0.06497, over 4276400.90 frames. ], batch size: 98, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:50:14,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1660512.0, ans=0.125 2023-06-26 21:50:20,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1660512.0, ans=0.07 2023-06-26 21:50:27,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-26 21:50:43,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1660572.0, ans=0.125 2023-06-26 21:50:47,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1660572.0, ans=10.0 2023-06-26 21:51:40,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.466e+02 6.340e+02 1.061e+03 1.425e+03 3.450e+03, threshold=2.122e+03, percent-clipped=15.0 2023-06-26 21:51:47,123 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-26 21:51:52,953 INFO [train.py:996] (3/4) Epoch 10, batch 2350, loss[loss=0.2415, simple_loss=0.3168, pruned_loss=0.08314, over 21882.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.287, pruned_loss=0.06556, over 4269969.47 frames. ], batch size: 372, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 21:52:11,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1660812.0, ans=0.125 2023-06-26 21:52:49,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1660932.0, ans=0.125 2023-06-26 21:52:51,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1660932.0, ans=0.2 2023-06-26 21:53:24,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-26 21:53:46,715 INFO [train.py:996] (3/4) Epoch 10, batch 2400, loss[loss=0.1816, simple_loss=0.2398, pruned_loss=0.06172, over 21446.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2889, pruned_loss=0.06781, over 4271167.30 frames. ], batch size: 212, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:54:15,803 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1661172.0, ans=0.0 2023-06-26 21:55:06,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2023-06-26 21:55:07,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1661292.0, ans=0.0 2023-06-26 21:55:17,525 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.139e+02 8.857e+02 1.254e+03 1.714e+03 3.828e+03, threshold=2.507e+03, percent-clipped=13.0 2023-06-26 21:55:23,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1661352.0, ans=0.125 2023-06-26 21:55:35,105 INFO [train.py:996] (3/4) Epoch 10, batch 2450, loss[loss=0.2083, simple_loss=0.2747, pruned_loss=0.07098, over 21153.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2909, pruned_loss=0.06831, over 4269035.81 frames. ], batch size: 143, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:57:22,908 INFO [train.py:996] (3/4) Epoch 10, batch 2500, loss[loss=0.2063, simple_loss=0.2806, pruned_loss=0.06596, over 21829.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2892, pruned_loss=0.06817, over 4275826.40 frames. ], batch size: 107, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:57:53,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1661772.0, ans=0.2 2023-06-26 21:58:19,809 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-26 21:58:52,891 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.949e+02 5.442e+02 7.727e+02 1.360e+03 2.872e+03, threshold=1.545e+03, percent-clipped=3.0 2023-06-26 21:59:16,972 INFO [train.py:996] (3/4) Epoch 10, batch 2550, loss[loss=0.2103, simple_loss=0.3027, pruned_loss=0.05902, over 21092.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.289, pruned_loss=0.06774, over 4273785.80 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 21:59:57,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1662132.0, ans=0.1 2023-06-26 22:00:12,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1662192.0, ans=0.125 2023-06-26 22:00:33,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1662192.0, ans=0.125 2023-06-26 22:00:36,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1662252.0, ans=0.125 2023-06-26 22:00:36,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1662252.0, ans=0.0 2023-06-26 22:00:43,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1662252.0, ans=0.0 2023-06-26 22:00:58,653 INFO [train.py:996] (3/4) Epoch 10, batch 2600, loss[loss=0.211, simple_loss=0.3001, pruned_loss=0.06098, over 21523.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2887, pruned_loss=0.06857, over 4270583.87 frames. ], batch size: 389, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:01:41,845 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1662372.0, ans=0.1 2023-06-26 22:01:43,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1662432.0, ans=0.0 2023-06-26 22:01:48,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1662432.0, ans=0.2 2023-06-26 22:02:07,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1662492.0, ans=0.125 2023-06-26 22:02:30,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.653e+02 5.932e+02 7.910e+02 1.183e+03 2.273e+03, threshold=1.582e+03, percent-clipped=10.0 2023-06-26 22:02:31,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1662552.0, ans=0.0 2023-06-26 22:02:31,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1662552.0, ans=0.0 2023-06-26 22:02:48,659 INFO [train.py:996] (3/4) Epoch 10, batch 2650, loss[loss=0.2142, simple_loss=0.3025, pruned_loss=0.06297, over 21851.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2888, pruned_loss=0.06875, over 4271253.16 frames. ], batch size: 371, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:03:21,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1662672.0, ans=0.0 2023-06-26 22:03:38,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1662732.0, ans=0.0 2023-06-26 22:03:46,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1662732.0, ans=0.0 2023-06-26 22:04:14,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1662852.0, ans=0.125 2023-06-26 22:04:43,340 INFO [train.py:996] (3/4) Epoch 10, batch 2700, loss[loss=0.194, simple_loss=0.2671, pruned_loss=0.06044, over 21795.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.288, pruned_loss=0.06861, over 4273973.38 frames. ], batch size: 282, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:04:44,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1662912.0, ans=0.0 2023-06-26 22:05:01,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1662972.0, ans=0.05 2023-06-26 22:05:27,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1663032.0, ans=0.0 2023-06-26 22:05:54,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1663092.0, ans=0.04949747468305833 2023-06-26 22:05:58,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1663092.0, ans=0.0 2023-06-26 22:06:09,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.093e+02 5.804e+02 8.533e+02 1.371e+03 2.390e+03, threshold=1.707e+03, percent-clipped=16.0 2023-06-26 22:06:31,057 INFO [train.py:996] (3/4) Epoch 10, batch 2750, loss[loss=0.2138, simple_loss=0.2855, pruned_loss=0.07107, over 21803.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2899, pruned_loss=0.06875, over 4277234.19 frames. ], batch size: 282, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:06:40,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1663212.0, ans=0.0 2023-06-26 22:07:53,077 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:08:21,189 INFO [train.py:996] (3/4) Epoch 10, batch 2800, loss[loss=0.2225, simple_loss=0.2998, pruned_loss=0.0726, over 21669.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.295, pruned_loss=0.06984, over 4278371.43 frames. ], batch size: 230, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:08:44,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=15.0 2023-06-26 22:08:59,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1663572.0, ans=0.1 2023-06-26 22:09:54,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1663752.0, ans=0.125 2023-06-26 22:10:00,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.922e+02 7.435e+02 1.264e+03 2.282e+03 6.620e+03, threshold=2.529e+03, percent-clipped=31.0 2023-06-26 22:10:05,422 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-26 22:10:11,257 INFO [train.py:996] (3/4) Epoch 10, batch 2850, loss[loss=0.2674, simple_loss=0.3385, pruned_loss=0.09813, over 21485.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2957, pruned_loss=0.07122, over 4276905.17 frames. ], batch size: 508, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:10:22,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1663812.0, ans=0.0 2023-06-26 22:10:26,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-26 22:10:41,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.30 vs. limit=10.0 2023-06-26 22:11:59,687 INFO [train.py:996] (3/4) Epoch 10, batch 2900, loss[loss=0.2071, simple_loss=0.2775, pruned_loss=0.06834, over 21809.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2933, pruned_loss=0.07073, over 4284452.63 frames. ], batch size: 247, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:12:00,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1664112.0, ans=0.125 2023-06-26 22:12:33,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1664172.0, ans=0.125 2023-06-26 22:12:33,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-06-26 22:13:31,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1664352.0, ans=0.1 2023-06-26 22:13:38,074 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.904e+02 5.286e+02 7.202e+02 1.145e+03 2.929e+03, threshold=1.440e+03, percent-clipped=1.0 2023-06-26 22:13:42,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1664352.0, ans=0.125 2023-06-26 22:13:46,799 INFO [train.py:996] (3/4) Epoch 10, batch 2950, loss[loss=0.2043, simple_loss=0.2861, pruned_loss=0.06125, over 21289.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2938, pruned_loss=0.07078, over 4288476.99 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:13:57,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1664412.0, ans=0.0 2023-06-26 22:14:22,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1664472.0, ans=0.0 2023-06-26 22:14:45,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1664532.0, ans=0.0 2023-06-26 22:15:09,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1664592.0, ans=0.1 2023-06-26 22:15:40,780 INFO [train.py:996] (3/4) Epoch 10, batch 3000, loss[loss=0.2357, simple_loss=0.3152, pruned_loss=0.07814, over 21603.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2977, pruned_loss=0.07147, over 4288093.47 frames. ], batch size: 230, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:15:40,780 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-26 22:15:58,650 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2517, simple_loss=0.3411, pruned_loss=0.08118, over 1796401.00 frames. 2023-06-26 22:15:58,651 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-26 22:17:05,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1664832.0, ans=0.125 2023-06-26 22:17:09,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1664892.0, ans=0.125 2023-06-26 22:17:14,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1664892.0, ans=0.125 2023-06-26 22:17:39,704 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.020e+02 5.823e+02 1.007e+03 1.425e+03 2.943e+03, threshold=2.014e+03, percent-clipped=25.0 2023-06-26 22:17:47,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1665012.0, ans=0.0 2023-06-26 22:17:48,234 INFO [train.py:996] (3/4) Epoch 10, batch 3050, loss[loss=0.1739, simple_loss=0.2469, pruned_loss=0.05046, over 21154.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.299, pruned_loss=0.07026, over 4291133.19 frames. ], batch size: 143, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:18:45,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1665132.0, ans=0.0 2023-06-26 22:19:24,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1665252.0, ans=0.125 2023-06-26 22:19:34,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1665252.0, ans=0.0 2023-06-26 22:19:37,768 INFO [train.py:996] (3/4) Epoch 10, batch 3100, loss[loss=0.1859, simple_loss=0.2777, pruned_loss=0.04707, over 21394.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.298, pruned_loss=0.06885, over 4289712.56 frames. ], batch size: 211, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:19:54,652 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.97 vs. limit=22.5 2023-06-26 22:19:55,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1665312.0, ans=0.0 2023-06-26 22:21:17,181 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.635e+02 5.384e+02 7.508e+02 1.175e+03 3.644e+03, threshold=1.502e+03, percent-clipped=4.0 2023-06-26 22:21:26,429 INFO [train.py:996] (3/4) Epoch 10, batch 3150, loss[loss=0.1846, simple_loss=0.2674, pruned_loss=0.05087, over 21652.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2982, pruned_loss=0.06913, over 4283594.09 frames. ], batch size: 263, lr: 3.00e-03, grad_scale: 8.0 2023-06-26 22:22:22,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1665732.0, ans=0.1 2023-06-26 22:23:22,054 INFO [train.py:996] (3/4) Epoch 10, batch 3200, loss[loss=0.1803, simple_loss=0.2619, pruned_loss=0.04941, over 21264.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2996, pruned_loss=0.06944, over 4281721.85 frames. ], batch size: 176, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:23:59,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1665972.0, ans=0.125 2023-06-26 22:24:13,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1666032.0, ans=0.0 2023-06-26 22:24:30,399 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 22:25:01,069 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.433e+02 6.467e+02 1.041e+03 1.408e+03 2.668e+03, threshold=2.081e+03, percent-clipped=19.0 2023-06-26 22:25:14,947 INFO [train.py:996] (3/4) Epoch 10, batch 3250, loss[loss=0.2279, simple_loss=0.3042, pruned_loss=0.07574, over 21186.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3, pruned_loss=0.07052, over 4287422.89 frames. ], batch size: 143, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:25:17,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1666212.0, ans=0.125 2023-06-26 22:26:07,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1666332.0, ans=0.0 2023-06-26 22:26:11,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1666332.0, ans=0.125 2023-06-26 22:26:14,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1666332.0, ans=0.125 2023-06-26 22:26:42,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1666452.0, ans=0.125 2023-06-26 22:26:45,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.79 vs. limit=22.5 2023-06-26 22:27:04,048 INFO [train.py:996] (3/4) Epoch 10, batch 3300, loss[loss=0.2401, simple_loss=0.333, pruned_loss=0.07361, over 21309.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2973, pruned_loss=0.07017, over 4279572.74 frames. ], batch size: 548, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:28:09,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1666692.0, ans=0.125 2023-06-26 22:28:42,867 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 7.426e+02 1.088e+03 1.707e+03 4.708e+03, threshold=2.176e+03, percent-clipped=17.0 2023-06-26 22:28:51,835 INFO [train.py:996] (3/4) Epoch 10, batch 3350, loss[loss=0.2212, simple_loss=0.2967, pruned_loss=0.07279, over 21720.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2997, pruned_loss=0.0711, over 4282597.28 frames. ], batch size: 112, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:30:30,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1667052.0, ans=0.0 2023-06-26 22:30:33,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1667052.0, ans=0.1 2023-06-26 22:30:39,086 INFO [train.py:996] (3/4) Epoch 10, batch 3400, loss[loss=0.2297, simple_loss=0.3052, pruned_loss=0.0771, over 21532.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3019, pruned_loss=0.0721, over 4285533.82 frames. ], batch size: 389, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:31:07,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1667172.0, ans=0.125 2023-06-26 22:32:20,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 6.513e+02 9.750e+02 1.536e+03 3.496e+03, threshold=1.950e+03, percent-clipped=9.0 2023-06-26 22:32:21,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-26 22:32:34,439 INFO [train.py:996] (3/4) Epoch 10, batch 3450, loss[loss=0.1819, simple_loss=0.2591, pruned_loss=0.05231, over 16319.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2974, pruned_loss=0.07135, over 4274675.65 frames. ], batch size: 62, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:32:37,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-26 22:33:31,875 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-26 22:34:03,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1667652.0, ans=0.125 2023-06-26 22:34:11,866 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-26 22:34:24,144 INFO [train.py:996] (3/4) Epoch 10, batch 3500, loss[loss=0.2452, simple_loss=0.3187, pruned_loss=0.08583, over 21657.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3037, pruned_loss=0.0738, over 4280279.96 frames. ], batch size: 263, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:35:12,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1667832.0, ans=0.1 2023-06-26 22:35:28,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1667832.0, ans=0.5 2023-06-26 22:35:35,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1667892.0, ans=0.0 2023-06-26 22:36:04,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.012e+02 7.162e+02 1.009e+03 1.814e+03 3.226e+03, threshold=2.018e+03, percent-clipped=21.0 2023-06-26 22:36:13,108 INFO [train.py:996] (3/4) Epoch 10, batch 3550, loss[loss=0.2232, simple_loss=0.2879, pruned_loss=0.07922, over 21307.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.305, pruned_loss=0.07533, over 4284426.09 frames. ], batch size: 143, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:36:20,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1668012.0, ans=0.125 2023-06-26 22:37:22,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1668192.0, ans=0.2 2023-06-26 22:37:36,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1668252.0, ans=10.0 2023-06-26 22:38:03,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1668252.0, ans=0.125 2023-06-26 22:38:06,111 INFO [train.py:996] (3/4) Epoch 10, batch 3600, loss[loss=0.2349, simple_loss=0.3123, pruned_loss=0.0787, over 21231.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2987, pruned_loss=0.0743, over 4280546.12 frames. ], batch size: 143, lr: 3.00e-03, grad_scale: 32.0 2023-06-26 22:39:20,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1668492.0, ans=0.1 2023-06-26 22:39:42,547 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.944e+02 5.183e+02 6.801e+02 1.024e+03 2.371e+03, threshold=1.360e+03, percent-clipped=4.0 2023-06-26 22:39:54,937 INFO [train.py:996] (3/4) Epoch 10, batch 3650, loss[loss=0.2325, simple_loss=0.2813, pruned_loss=0.09192, over 21240.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2983, pruned_loss=0.07379, over 4283173.39 frames. ], batch size: 471, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:40:04,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1668612.0, ans=0.125 2023-06-26 22:40:52,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1668732.0, ans=0.1 2023-06-26 22:41:13,545 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-26 22:41:14,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1668852.0, ans=0.2 2023-06-26 22:41:40,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1668912.0, ans=0.05 2023-06-26 22:41:41,191 INFO [train.py:996] (3/4) Epoch 10, batch 3700, loss[loss=0.2172, simple_loss=0.3125, pruned_loss=0.06092, over 21008.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.298, pruned_loss=0.07284, over 4286886.78 frames. ], batch size: 608, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:41:50,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1668912.0, ans=0.0 2023-06-26 22:41:55,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1668912.0, ans=0.1 2023-06-26 22:43:23,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.784e+02 6.221e+02 8.574e+02 1.297e+03 2.866e+03, threshold=1.715e+03, percent-clipped=21.0 2023-06-26 22:43:30,695 INFO [train.py:996] (3/4) Epoch 10, batch 3750, loss[loss=0.2005, simple_loss=0.2844, pruned_loss=0.05831, over 21705.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2982, pruned_loss=0.07277, over 4286797.71 frames. ], batch size: 389, lr: 3.00e-03, grad_scale: 16.0 2023-06-26 22:43:33,585 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-26 22:43:43,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1669212.0, ans=0.125 2023-06-26 22:45:07,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1669452.0, ans=0.0 2023-06-26 22:45:18,875 INFO [train.py:996] (3/4) Epoch 10, batch 3800, loss[loss=0.1978, simple_loss=0.268, pruned_loss=0.06382, over 20055.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2964, pruned_loss=0.07138, over 4279681.58 frames. ], batch size: 703, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:46:41,589 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=22.5 2023-06-26 22:46:58,175 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.737e+02 5.847e+02 8.030e+02 1.160e+03 2.493e+03, threshold=1.606e+03, percent-clipped=8.0 2023-06-26 22:47:10,254 INFO [train.py:996] (3/4) Epoch 10, batch 3850, loss[loss=0.1855, simple_loss=0.2525, pruned_loss=0.05926, over 21357.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2933, pruned_loss=0.07155, over 4284915.14 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:47:14,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1669812.0, ans=0.125 2023-06-26 22:47:53,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-26 22:47:58,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1669932.0, ans=0.125 2023-06-26 22:48:08,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1669992.0, ans=0.04949747468305833 2023-06-26 22:48:15,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1669992.0, ans=0.125 2023-06-26 22:48:38,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1670052.0, ans=0.05 2023-06-26 22:48:42,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-26 22:48:50,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1670112.0, ans=0.95 2023-06-26 22:48:51,939 INFO [train.py:996] (3/4) Epoch 10, batch 3900, loss[loss=0.2167, simple_loss=0.2913, pruned_loss=0.07102, over 21802.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2891, pruned_loss=0.07141, over 4290923.33 frames. ], batch size: 391, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:49:38,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1670232.0, ans=0.0 2023-06-26 22:49:38,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1670232.0, ans=0.0 2023-06-26 22:49:42,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1670232.0, ans=0.07 2023-06-26 22:50:28,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1670352.0, ans=0.5 2023-06-26 22:50:40,110 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.341e+02 6.738e+02 9.125e+02 1.558e+03 3.098e+03, threshold=1.825e+03, percent-clipped=22.0 2023-06-26 22:50:44,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1670352.0, ans=0.1 2023-06-26 22:50:47,227 INFO [train.py:996] (3/4) Epoch 10, batch 3950, loss[loss=0.1854, simple_loss=0.2832, pruned_loss=0.0438, over 21615.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2914, pruned_loss=0.07046, over 4292008.51 frames. ], batch size: 389, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:50:49,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1670412.0, ans=0.125 2023-06-26 22:50:53,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1670412.0, ans=0.125 2023-06-26 22:51:16,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1670472.0, ans=0.0 2023-06-26 22:51:52,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1670532.0, ans=0.125 2023-06-26 22:52:33,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1670652.0, ans=0.125 2023-06-26 22:52:35,951 INFO [train.py:996] (3/4) Epoch 10, batch 4000, loss[loss=0.182, simple_loss=0.2501, pruned_loss=0.05693, over 21554.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2848, pruned_loss=0.06721, over 4281322.59 frames. ], batch size: 247, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 22:53:41,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1670832.0, ans=0.125 2023-06-26 22:53:57,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1670892.0, ans=0.125 2023-06-26 22:54:14,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1670952.0, ans=0.0 2023-06-26 22:54:19,969 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.363e+02 6.033e+02 8.423e+02 1.568e+03 3.555e+03, threshold=1.685e+03, percent-clipped=19.0 2023-06-26 22:54:31,309 INFO [train.py:996] (3/4) Epoch 10, batch 4050, loss[loss=0.1803, simple_loss=0.2506, pruned_loss=0.05498, over 16838.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2828, pruned_loss=0.06547, over 4264504.18 frames. ], batch size: 61, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:54:47,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1671072.0, ans=0.2 2023-06-26 22:55:23,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-26 22:55:40,665 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.85 vs. limit=10.0 2023-06-26 22:55:42,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.52 vs. limit=12.0 2023-06-26 22:56:03,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.57 vs. limit=15.0 2023-06-26 22:56:19,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1671312.0, ans=0.1 2023-06-26 22:56:20,346 INFO [train.py:996] (3/4) Epoch 10, batch 4100, loss[loss=0.2325, simple_loss=0.323, pruned_loss=0.07105, over 21664.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2849, pruned_loss=0.06568, over 4276020.10 frames. ], batch size: 389, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:57:07,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.26 vs. limit=5.0 2023-06-26 22:57:15,177 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=22.5 2023-06-26 22:57:35,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.10 vs. limit=12.0 2023-06-26 22:57:37,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-26 22:57:54,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1671552.0, ans=0.04949747468305833 2023-06-26 22:57:57,628 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.779e+02 5.678e+02 9.516e+02 1.395e+03 3.425e+03, threshold=1.903e+03, percent-clipped=17.0 2023-06-26 22:58:02,751 INFO [train.py:996] (3/4) Epoch 10, batch 4150, loss[loss=0.1976, simple_loss=0.2846, pruned_loss=0.05534, over 21720.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2859, pruned_loss=0.06431, over 4270833.92 frames. ], batch size: 333, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 22:58:13,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1671612.0, ans=0.0 2023-06-26 22:58:18,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1671672.0, ans=0.125 2023-06-26 22:59:21,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1671792.0, ans=0.05 2023-06-26 22:59:48,064 INFO [train.py:996] (3/4) Epoch 10, batch 4200, loss[loss=0.1729, simple_loss=0.2529, pruned_loss=0.04647, over 21555.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2869, pruned_loss=0.06441, over 4262025.32 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:00:13,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1671972.0, ans=0.125 2023-06-26 23:00:31,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1672032.0, ans=0.2 2023-06-26 23:00:33,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1672032.0, ans=0.125 2023-06-26 23:01:12,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1672092.0, ans=0.125 2023-06-26 23:01:29,526 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.633e+02 4.957e+02 6.956e+02 1.176e+03 3.842e+03, threshold=1.391e+03, percent-clipped=7.0 2023-06-26 23:01:33,256 INFO [train.py:996] (3/4) Epoch 10, batch 4250, loss[loss=0.2139, simple_loss=0.2965, pruned_loss=0.0656, over 21387.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2926, pruned_loss=0.06649, over 4257900.18 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:02:11,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1672272.0, ans=0.5 2023-06-26 23:03:00,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-26 23:03:30,179 INFO [train.py:996] (3/4) Epoch 10, batch 4300, loss[loss=0.281, simple_loss=0.3699, pruned_loss=0.09602, over 21468.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2982, pruned_loss=0.06767, over 4261904.31 frames. ], batch size: 507, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:03:46,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1672572.0, ans=0.2 2023-06-26 23:04:06,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1672572.0, ans=0.125 2023-06-26 23:04:15,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1672632.0, ans=0.125 2023-06-26 23:04:22,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1672632.0, ans=0.1 2023-06-26 23:05:12,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1672752.0, ans=0.0 2023-06-26 23:05:15,311 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.979e+02 6.207e+02 8.849e+02 1.440e+03 4.327e+03, threshold=1.770e+03, percent-clipped=25.0 2023-06-26 23:05:18,710 INFO [train.py:996] (3/4) Epoch 10, batch 4350, loss[loss=0.1848, simple_loss=0.2536, pruned_loss=0.05798, over 21222.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2988, pruned_loss=0.06699, over 4261011.30 frames. ], batch size: 144, lr: 2.99e-03, grad_scale: 8.0 2023-06-26 23:05:54,700 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.05 vs. limit=15.0 2023-06-26 23:06:12,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1672932.0, ans=0.125 2023-06-26 23:06:13,432 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-26 23:06:16,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1672932.0, ans=10.0 2023-06-26 23:06:18,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1672932.0, ans=0.0 2023-06-26 23:06:23,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1672992.0, ans=0.0 2023-06-26 23:07:07,224 INFO [train.py:996] (3/4) Epoch 10, batch 4400, loss[loss=0.2075, simple_loss=0.2986, pruned_loss=0.05818, over 21316.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2954, pruned_loss=0.06695, over 4261805.18 frames. ], batch size: 176, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:08:52,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.985e+02 5.856e+02 8.779e+02 1.198e+03 2.482e+03, threshold=1.756e+03, percent-clipped=8.0 2023-06-26 23:08:56,188 INFO [train.py:996] (3/4) Epoch 10, batch 4450, loss[loss=0.237, simple_loss=0.3249, pruned_loss=0.07456, over 21369.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3024, pruned_loss=0.06807, over 4256604.64 frames. ], batch size: 194, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:10:21,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1673592.0, ans=0.125 2023-06-26 23:10:21,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1673592.0, ans=0.2 2023-06-26 23:10:29,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1673652.0, ans=0.0 2023-06-26 23:10:45,070 INFO [train.py:996] (3/4) Epoch 10, batch 4500, loss[loss=0.2294, simple_loss=0.3091, pruned_loss=0.07482, over 21878.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3028, pruned_loss=0.06979, over 4260230.79 frames. ], batch size: 118, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:11:11,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1673712.0, ans=0.125 2023-06-26 23:11:28,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1673772.0, ans=0.0 2023-06-26 23:11:32,208 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.85 vs. limit=10.0 2023-06-26 23:11:38,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1673832.0, ans=0.125 2023-06-26 23:12:31,379 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.557e+02 6.437e+02 9.027e+02 1.407e+03 3.220e+03, threshold=1.805e+03, percent-clipped=13.0 2023-06-26 23:12:46,687 INFO [train.py:996] (3/4) Epoch 10, batch 4550, loss[loss=0.2363, simple_loss=0.3222, pruned_loss=0.07516, over 21751.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3069, pruned_loss=0.07116, over 4267144.84 frames. ], batch size: 332, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:13:41,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1674132.0, ans=0.125 2023-06-26 23:14:34,566 INFO [train.py:996] (3/4) Epoch 10, batch 4600, loss[loss=0.3184, simple_loss=0.4451, pruned_loss=0.09583, over 19716.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3109, pruned_loss=0.07315, over 4271441.66 frames. ], batch size: 702, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:15:02,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1674372.0, ans=0.04949747468305833 2023-06-26 23:15:26,339 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-26 23:15:27,187 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:15:27,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1674432.0, ans=0.0 2023-06-26 23:15:27,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1674432.0, ans=0.0 2023-06-26 23:15:32,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1674432.0, ans=0.125 2023-06-26 23:15:52,161 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.25 vs. limit=12.0 2023-06-26 23:15:55,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=22.5 2023-06-26 23:16:17,982 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.163e+02 6.181e+02 9.452e+02 1.480e+03 3.323e+03, threshold=1.890e+03, percent-clipped=16.0 2023-06-26 23:16:21,526 INFO [train.py:996] (3/4) Epoch 10, batch 4650, loss[loss=0.1839, simple_loss=0.2601, pruned_loss=0.05383, over 21796.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3037, pruned_loss=0.0714, over 4285703.98 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:17:49,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1674852.0, ans=0.2 2023-06-26 23:18:08,121 INFO [train.py:996] (3/4) Epoch 10, batch 4700, loss[loss=0.2295, simple_loss=0.2736, pruned_loss=0.09267, over 21441.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2937, pruned_loss=0.06877, over 4272819.55 frames. ], batch size: 509, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:18:15,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1674912.0, ans=0.125 2023-06-26 23:18:41,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1674972.0, ans=0.0 2023-06-26 23:19:14,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1675092.0, ans=0.1 2023-06-26 23:19:50,876 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.765e+02 4.747e+02 5.523e+02 7.889e+02 1.677e+03, threshold=1.105e+03, percent-clipped=0.0 2023-06-26 23:19:54,037 INFO [train.py:996] (3/4) Epoch 10, batch 4750, loss[loss=0.2284, simple_loss=0.3003, pruned_loss=0.07824, over 21765.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2884, pruned_loss=0.06925, over 4274995.15 frames. ], batch size: 112, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:21:09,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1675392.0, ans=0.125 2023-06-26 23:21:41,787 INFO [train.py:996] (3/4) Epoch 10, batch 4800, loss[loss=0.1997, simple_loss=0.2791, pruned_loss=0.06014, over 21269.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2893, pruned_loss=0.06926, over 4273189.74 frames. ], batch size: 159, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 23:22:08,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1675572.0, ans=0.1 2023-06-26 23:22:29,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1675632.0, ans=0.125 2023-06-26 23:22:35,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.79 vs. limit=15.0 2023-06-26 23:22:36,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1675632.0, ans=0.0 2023-06-26 23:23:17,638 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:23:18,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-26 23:23:25,455 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.997e+02 5.704e+02 8.592e+02 1.252e+03 2.093e+03, threshold=1.718e+03, percent-clipped=31.0 2023-06-26 23:23:27,163 INFO [train.py:996] (3/4) Epoch 10, batch 4850, loss[loss=0.2115, simple_loss=0.2955, pruned_loss=0.0637, over 20839.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2876, pruned_loss=0.06829, over 4271776.59 frames. ], batch size: 608, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:23:28,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-26 23:23:31,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1675812.0, ans=0.125 2023-06-26 23:23:49,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1675872.0, ans=0.125 2023-06-26 23:24:01,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1675872.0, ans=0.125 2023-06-26 23:24:39,339 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-26 23:25:15,488 INFO [train.py:996] (3/4) Epoch 10, batch 4900, loss[loss=0.2048, simple_loss=0.284, pruned_loss=0.06278, over 21602.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2894, pruned_loss=0.06882, over 4276363.48 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:25:19,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1676112.0, ans=0.125 2023-06-26 23:25:27,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1676112.0, ans=0.0 2023-06-26 23:25:52,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1676172.0, ans=0.125 2023-06-26 23:26:06,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1676232.0, ans=0.125 2023-06-26 23:26:49,116 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:26:54,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1676352.0, ans=0.125 2023-06-26 23:26:54,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-26 23:27:07,351 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.910e+02 6.746e+02 9.232e+02 1.272e+03 2.922e+03, threshold=1.846e+03, percent-clipped=7.0 2023-06-26 23:27:08,933 INFO [train.py:996] (3/4) Epoch 10, batch 4950, loss[loss=0.1693, simple_loss=0.2429, pruned_loss=0.0478, over 21861.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2932, pruned_loss=0.06718, over 4272056.03 frames. ], batch size: 107, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:27:09,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1676412.0, ans=0.125 2023-06-26 23:27:26,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1676412.0, ans=0.2 2023-06-26 23:27:37,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1676472.0, ans=0.0 2023-06-26 23:28:02,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1676532.0, ans=0.2 2023-06-26 23:28:04,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1676592.0, ans=0.1 2023-06-26 23:28:44,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1676652.0, ans=0.1 2023-06-26 23:28:50,817 INFO [train.py:996] (3/4) Epoch 10, batch 5000, loss[loss=0.195, simple_loss=0.2744, pruned_loss=0.05776, over 21506.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2916, pruned_loss=0.06403, over 4273283.86 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:29:09,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1676712.0, ans=0.95 2023-06-26 23:29:12,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1676712.0, ans=0.125 2023-06-26 23:29:26,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=12.0 2023-06-26 23:29:32,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1676772.0, ans=0.1 2023-06-26 23:29:48,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-26 23:30:07,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-26 23:30:25,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=15.0 2023-06-26 23:30:35,683 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.540e+02 5.923e+02 8.910e+02 1.386e+03 2.915e+03, threshold=1.782e+03, percent-clipped=9.0 2023-06-26 23:30:37,444 INFO [train.py:996] (3/4) Epoch 10, batch 5050, loss[loss=0.2208, simple_loss=0.2908, pruned_loss=0.07546, over 21883.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2918, pruned_loss=0.06555, over 4279282.33 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:31:06,028 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-26 23:31:12,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1677072.0, ans=0.1 2023-06-26 23:31:14,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1677072.0, ans=0.125 2023-06-26 23:31:30,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1677132.0, ans=0.0 2023-06-26 23:31:38,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1677192.0, ans=0.07 2023-06-26 23:32:14,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1677252.0, ans=0.125 2023-06-26 23:32:22,425 INFO [train.py:996] (3/4) Epoch 10, batch 5100, loss[loss=0.1856, simple_loss=0.2713, pruned_loss=0.04994, over 21781.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2907, pruned_loss=0.06656, over 4281104.12 frames. ], batch size: 414, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:32:23,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-26 23:32:34,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1677312.0, ans=0.125 2023-06-26 23:33:04,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1677372.0, ans=0.125 2023-06-26 23:33:12,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1677432.0, ans=0.2 2023-06-26 23:34:07,922 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.053e+02 6.342e+02 8.169e+02 1.053e+03 2.713e+03, threshold=1.634e+03, percent-clipped=6.0 2023-06-26 23:34:09,478 INFO [train.py:996] (3/4) Epoch 10, batch 5150, loss[loss=0.1902, simple_loss=0.264, pruned_loss=0.05821, over 21957.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2895, pruned_loss=0.06708, over 4285175.92 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:34:10,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1677612.0, ans=0.125 2023-06-26 23:34:30,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1677612.0, ans=0.1 2023-06-26 23:35:00,012 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:35:39,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1677852.0, ans=0.125 2023-06-26 23:36:00,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1677852.0, ans=0.2 2023-06-26 23:36:03,521 INFO [train.py:996] (3/4) Epoch 10, batch 5200, loss[loss=0.2737, simple_loss=0.369, pruned_loss=0.08917, over 21515.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2916, pruned_loss=0.06786, over 4285985.27 frames. ], batch size: 471, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 23:36:31,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1677972.0, ans=0.015 2023-06-26 23:36:39,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1677972.0, ans=0.0 2023-06-26 23:36:44,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1678032.0, ans=0.0 2023-06-26 23:36:48,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1678032.0, ans=0.125 2023-06-26 23:37:00,999 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-26 23:37:02,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.53 vs. limit=15.0 2023-06-26 23:37:32,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1678152.0, ans=0.1 2023-06-26 23:37:34,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-26 23:37:50,415 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.940e+02 5.817e+02 8.011e+02 1.324e+03 3.418e+03, threshold=1.602e+03, percent-clipped=14.0 2023-06-26 23:37:50,453 INFO [train.py:996] (3/4) Epoch 10, batch 5250, loss[loss=0.2353, simple_loss=0.3208, pruned_loss=0.07495, over 21826.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2959, pruned_loss=0.06682, over 4280951.58 frames. ], batch size: 371, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:37:51,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-06-26 23:38:26,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1678272.0, ans=0.125 2023-06-26 23:39:00,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1678392.0, ans=0.125 2023-06-26 23:39:35,325 INFO [train.py:996] (3/4) Epoch 10, batch 5300, loss[loss=0.216, simple_loss=0.2953, pruned_loss=0.0683, over 21875.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2944, pruned_loss=0.0676, over 4276215.89 frames. ], batch size: 107, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:39:39,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1678512.0, ans=0.1 2023-06-26 23:40:06,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1678572.0, ans=0.125 2023-06-26 23:40:33,004 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-26 23:40:52,704 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 23:41:10,305 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-26 23:41:21,207 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.821e+02 5.421e+02 7.005e+02 9.056e+02 1.380e+03, threshold=1.401e+03, percent-clipped=0.0 2023-06-26 23:41:21,238 INFO [train.py:996] (3/4) Epoch 10, batch 5350, loss[loss=0.2062, simple_loss=0.2824, pruned_loss=0.06504, over 21955.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2934, pruned_loss=0.06868, over 4276871.75 frames. ], batch size: 333, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:41:23,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1678812.0, ans=0.125 2023-06-26 23:41:29,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2023-06-26 23:41:36,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.47 vs. limit=15.0 2023-06-26 23:41:41,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1678872.0, ans=0.0 2023-06-26 23:42:03,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1678932.0, ans=0.2 2023-06-26 23:42:18,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1678932.0, ans=0.125 2023-06-26 23:42:40,297 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-26 23:42:54,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1679052.0, ans=0.2 2023-06-26 23:42:54,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1679052.0, ans=0.0 2023-06-26 23:43:05,911 INFO [train.py:996] (3/4) Epoch 10, batch 5400, loss[loss=0.2003, simple_loss=0.2737, pruned_loss=0.06339, over 21657.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2923, pruned_loss=0.06984, over 4281684.63 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:43:06,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-26 23:43:31,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1679172.0, ans=0.125 2023-06-26 23:43:36,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1679172.0, ans=0.2 2023-06-26 23:43:42,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-26 23:44:20,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1679292.0, ans=0.04949747468305833 2023-06-26 23:44:40,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1679352.0, ans=0.0 2023-06-26 23:44:53,961 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.666e+02 6.862e+02 1.175e+03 1.926e+03 4.033e+03, threshold=2.351e+03, percent-clipped=41.0 2023-06-26 23:44:53,992 INFO [train.py:996] (3/4) Epoch 10, batch 5450, loss[loss=0.2492, simple_loss=0.392, pruned_loss=0.05324, over 19788.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.294, pruned_loss=0.06873, over 4283098.40 frames. ], batch size: 702, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:46:04,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1679532.0, ans=0.125 2023-06-26 23:46:50,758 INFO [train.py:996] (3/4) Epoch 10, batch 5500, loss[loss=0.2105, simple_loss=0.3107, pruned_loss=0.05516, over 21660.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2992, pruned_loss=0.06581, over 4283662.36 frames. ], batch size: 389, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:47:25,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-26 23:48:05,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1679892.0, ans=0.0 2023-06-26 23:48:24,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1679952.0, ans=0.125 2023-06-26 23:48:31,669 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.29 vs. limit=15.0 2023-06-26 23:48:31,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-26 23:48:34,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1679952.0, ans=0.1 2023-06-26 23:48:48,465 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.727e+02 5.357e+02 7.450e+02 1.317e+03 3.051e+03, threshold=1.490e+03, percent-clipped=6.0 2023-06-26 23:48:48,497 INFO [train.py:996] (3/4) Epoch 10, batch 5550, loss[loss=0.1565, simple_loss=0.2517, pruned_loss=0.03062, over 21791.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2973, pruned_loss=0.06312, over 4272545.29 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:49:03,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1680012.0, ans=0.2 2023-06-26 23:49:43,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1680132.0, ans=0.125 2023-06-26 23:49:57,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-26 23:50:10,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1680192.0, ans=10.0 2023-06-26 23:50:38,654 INFO [train.py:996] (3/4) Epoch 10, batch 5600, loss[loss=0.191, simple_loss=0.2764, pruned_loss=0.05281, over 21197.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2949, pruned_loss=0.06052, over 4277252.21 frames. ], batch size: 176, lr: 2.99e-03, grad_scale: 32.0 2023-06-26 23:50:52,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1680312.0, ans=0.125 2023-06-26 23:51:19,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1680432.0, ans=0.0 2023-06-26 23:51:25,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1680432.0, ans=0.0 2023-06-26 23:51:52,025 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=12.0 2023-06-26 23:52:01,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1680552.0, ans=0.04949747468305833 2023-06-26 23:52:12,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1680552.0, ans=0.125 2023-06-26 23:52:25,069 INFO [train.py:996] (3/4) Epoch 10, batch 5650, loss[loss=0.2064, simple_loss=0.2793, pruned_loss=0.06674, over 21204.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2972, pruned_loss=0.06178, over 4275330.36 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-26 23:52:27,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.741e+02 5.468e+02 7.224e+02 1.167e+03 2.877e+03, threshold=1.445e+03, percent-clipped=12.0 2023-06-26 23:52:29,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1680612.0, ans=0.125 2023-06-26 23:52:29,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1680612.0, ans=15.0 2023-06-26 23:52:48,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1680672.0, ans=0.2 2023-06-26 23:53:06,668 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-26 23:53:16,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1680732.0, ans=0.125 2023-06-26 23:53:30,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1680792.0, ans=0.1 2023-06-26 23:54:02,449 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-26 23:54:13,502 INFO [train.py:996] (3/4) Epoch 10, batch 5700, loss[loss=0.2163, simple_loss=0.306, pruned_loss=0.06326, over 21802.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2961, pruned_loss=0.06296, over 4284380.06 frames. ], batch size: 351, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:56:09,503 INFO [train.py:996] (3/4) Epoch 10, batch 5750, loss[loss=0.174, simple_loss=0.2715, pruned_loss=0.03822, over 21758.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2947, pruned_loss=0.0614, over 4274841.35 frames. ], batch size: 282, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:56:11,423 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.677e+02 6.670e+02 9.043e+02 1.357e+03 3.417e+03, threshold=1.809e+03, percent-clipped=19.0 2023-06-26 23:56:12,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-26 23:57:58,025 INFO [train.py:996] (3/4) Epoch 10, batch 5800, loss[loss=0.2097, simple_loss=0.3016, pruned_loss=0.0589, over 21724.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2939, pruned_loss=0.06079, over 4267677.05 frames. ], batch size: 247, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:58:17,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1681512.0, ans=0.125 2023-06-26 23:58:21,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1681572.0, ans=0.1 2023-06-26 23:59:25,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1681692.0, ans=0.125 2023-06-26 23:59:40,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1681752.0, ans=0.0 2023-06-26 23:59:46,291 INFO [train.py:996] (3/4) Epoch 10, batch 5850, loss[loss=0.1704, simple_loss=0.2769, pruned_loss=0.03191, over 21767.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2923, pruned_loss=0.05748, over 4265277.92 frames. ], batch size: 282, lr: 2.98e-03, grad_scale: 16.0 2023-06-26 23:59:53,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.721e+02 4.995e+02 7.881e+02 1.168e+03 2.434e+03, threshold=1.576e+03, percent-clipped=1.0 2023-06-27 00:00:37,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1681932.0, ans=0.0 2023-06-27 00:01:07,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1681992.0, ans=0.0 2023-06-27 00:01:36,698 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:01:37,800 INFO [train.py:996] (3/4) Epoch 10, batch 5900, loss[loss=0.1411, simple_loss=0.2381, pruned_loss=0.02202, over 21732.00 frames. ], tot_loss[loss=0.1947, simple_loss=0.2847, pruned_loss=0.05233, over 4269271.32 frames. ], batch size: 298, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:01:38,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1682112.0, ans=0.125 2023-06-27 00:02:09,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1682172.0, ans=0.1 2023-06-27 00:02:12,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1682172.0, ans=0.125 2023-06-27 00:02:48,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.43 vs. limit=10.0 2023-06-27 00:02:57,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1682292.0, ans=0.0 2023-06-27 00:03:14,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=22.5 2023-06-27 00:03:21,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1682352.0, ans=0.125 2023-06-27 00:03:24,110 INFO [train.py:996] (3/4) Epoch 10, batch 5950, loss[loss=0.1849, simple_loss=0.2601, pruned_loss=0.05479, over 21481.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2836, pruned_loss=0.05583, over 4281509.38 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:03:25,857 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.299e+02 4.862e+02 7.145e+02 9.461e+02 2.592e+03, threshold=1.429e+03, percent-clipped=2.0 2023-06-27 00:04:00,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1682472.0, ans=0.125 2023-06-27 00:04:31,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1682592.0, ans=0.125 2023-06-27 00:04:39,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1682592.0, ans=0.1 2023-06-27 00:04:59,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1682652.0, ans=0.2 2023-06-27 00:05:02,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1682652.0, ans=0.2 2023-06-27 00:05:06,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1682652.0, ans=0.125 2023-06-27 00:05:08,655 INFO [train.py:996] (3/4) Epoch 10, batch 6000, loss[loss=0.1748, simple_loss=0.2468, pruned_loss=0.05136, over 21668.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2794, pruned_loss=0.05838, over 4276902.97 frames. ], batch size: 299, lr: 2.98e-03, grad_scale: 32.0 2023-06-27 00:05:08,656 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 00:05:25,662 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.9369, 4.4555, 4.6372, 3.6294], device='cuda:3') 2023-06-27 00:05:29,817 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2604, simple_loss=0.3533, pruned_loss=0.08374, over 1796401.00 frames. 2023-06-27 00:05:29,818 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 00:07:18,978 INFO [train.py:996] (3/4) Epoch 10, batch 6050, loss[loss=0.223, simple_loss=0.2766, pruned_loss=0.08474, over 21528.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2745, pruned_loss=0.05987, over 4275527.85 frames. ], batch size: 442, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:07:24,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.965e+02 5.435e+02 7.983e+02 1.281e+03 2.662e+03, threshold=1.597e+03, percent-clipped=18.0 2023-06-27 00:07:57,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1683132.0, ans=0.125 2023-06-27 00:08:13,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1683132.0, ans=0.1 2023-06-27 00:08:53,710 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-27 00:09:06,568 INFO [train.py:996] (3/4) Epoch 10, batch 6100, loss[loss=0.2352, simple_loss=0.302, pruned_loss=0.08419, over 21978.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2744, pruned_loss=0.05902, over 4273900.72 frames. ], batch size: 113, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:10:53,278 INFO [train.py:996] (3/4) Epoch 10, batch 6150, loss[loss=0.1923, simple_loss=0.2672, pruned_loss=0.05872, over 21150.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2773, pruned_loss=0.06052, over 4276547.26 frames. ], batch size: 143, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:10:58,724 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.616e+02 5.589e+02 9.647e+02 1.302e+03 3.090e+03, threshold=1.929e+03, percent-clipped=16.0 2023-06-27 00:11:28,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1683672.0, ans=0.2 2023-06-27 00:11:53,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1683732.0, ans=0.125 2023-06-27 00:12:01,243 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.89 vs. limit=22.5 2023-06-27 00:12:02,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1683792.0, ans=0.035 2023-06-27 00:12:08,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1683792.0, ans=0.125 2023-06-27 00:12:42,238 INFO [train.py:996] (3/4) Epoch 10, batch 6200, loss[loss=0.1501, simple_loss=0.217, pruned_loss=0.04162, over 17221.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2819, pruned_loss=0.06187, over 4279648.55 frames. ], batch size: 66, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:13:19,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1683972.0, ans=0.125 2023-06-27 00:14:31,212 INFO [train.py:996] (3/4) Epoch 10, batch 6250, loss[loss=0.2068, simple_loss=0.3141, pruned_loss=0.04978, over 21843.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2869, pruned_loss=0.06152, over 4286174.14 frames. ], batch size: 371, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:14:31,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1684212.0, ans=0.125 2023-06-27 00:14:36,263 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.907e+02 5.995e+02 9.540e+02 1.636e+03 4.135e+03, threshold=1.908e+03, percent-clipped=20.0 2023-06-27 00:15:21,174 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:16:16,350 INFO [train.py:996] (3/4) Epoch 10, batch 6300, loss[loss=0.206, simple_loss=0.3076, pruned_loss=0.05219, over 20837.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.29, pruned_loss=0.06086, over 4288590.00 frames. ], batch size: 608, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:16:30,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1684512.0, ans=0.125 2023-06-27 00:16:41,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1684572.0, ans=0.1 2023-06-27 00:17:07,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1684632.0, ans=0.0 2023-06-27 00:17:31,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1684692.0, ans=0.125 2023-06-27 00:17:48,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1684752.0, ans=0.09899494936611666 2023-06-27 00:18:08,445 INFO [train.py:996] (3/4) Epoch 10, batch 6350, loss[loss=0.2162, simple_loss=0.3282, pruned_loss=0.05214, over 20882.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2939, pruned_loss=0.06491, over 4290735.81 frames. ], batch size: 608, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:18:13,741 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.722e+02 5.276e+02 6.494e+02 9.126e+02 1.517e+03, threshold=1.299e+03, percent-clipped=0.0 2023-06-27 00:18:35,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1684872.0, ans=0.1 2023-06-27 00:18:47,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1684932.0, ans=0.125 2023-06-27 00:19:57,965 INFO [train.py:996] (3/4) Epoch 10, batch 6400, loss[loss=0.2278, simple_loss=0.302, pruned_loss=0.0768, over 21496.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3002, pruned_loss=0.06915, over 4288121.36 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:20:15,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1685112.0, ans=10.0 2023-06-27 00:20:36,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1685172.0, ans=0.125 2023-06-27 00:21:36,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1685352.0, ans=0.2 2023-06-27 00:21:51,033 INFO [train.py:996] (3/4) Epoch 10, batch 6450, loss[loss=0.1924, simple_loss=0.2693, pruned_loss=0.05773, over 21116.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.3006, pruned_loss=0.06729, over 4279877.62 frames. ], batch size: 143, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:21:55,952 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.237e+02 6.943e+02 1.024e+03 1.521e+03 2.741e+03, threshold=2.048e+03, percent-clipped=32.0 2023-06-27 00:22:22,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1685472.0, ans=0.125 2023-06-27 00:23:22,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1685652.0, ans=0.0 2023-06-27 00:23:37,623 INFO [train.py:996] (3/4) Epoch 10, batch 6500, loss[loss=0.1874, simple_loss=0.2586, pruned_loss=0.0581, over 21325.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2935, pruned_loss=0.06589, over 4280853.68 frames. ], batch size: 159, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:23:55,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.77 vs. limit=22.5 2023-06-27 00:24:08,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1685772.0, ans=0.1 2023-06-27 00:24:52,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1685892.0, ans=0.125 2023-06-27 00:25:10,388 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-27 00:25:23,185 INFO [train.py:996] (3/4) Epoch 10, batch 6550, loss[loss=0.2, simple_loss=0.2791, pruned_loss=0.06045, over 21758.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2915, pruned_loss=0.06524, over 4280528.24 frames. ], batch size: 247, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:25:24,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-27 00:25:28,443 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.027e+02 5.505e+02 8.547e+02 1.330e+03 2.902e+03, threshold=1.709e+03, percent-clipped=6.0 2023-06-27 00:25:29,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1686012.0, ans=0.09899494936611666 2023-06-27 00:25:34,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1686012.0, ans=0.125 2023-06-27 00:25:44,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1686072.0, ans=0.5 2023-06-27 00:25:50,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1686072.0, ans=0.0 2023-06-27 00:26:33,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-27 00:27:05,686 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:27:10,143 INFO [train.py:996] (3/4) Epoch 10, batch 6600, loss[loss=0.1926, simple_loss=0.2617, pruned_loss=0.06175, over 21776.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2875, pruned_loss=0.0656, over 4276457.42 frames. ], batch size: 102, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:27:48,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1686372.0, ans=0.07 2023-06-27 00:28:19,909 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-27 00:28:47,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1686552.0, ans=0.125 2023-06-27 00:28:52,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1686552.0, ans=0.1 2023-06-27 00:28:57,097 INFO [train.py:996] (3/4) Epoch 10, batch 6650, loss[loss=0.1813, simple_loss=0.2623, pruned_loss=0.05012, over 21796.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2799, pruned_loss=0.06239, over 4276171.27 frames. ], batch size: 317, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:29:09,395 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.441e+02 5.556e+02 7.751e+02 1.155e+03 2.381e+03, threshold=1.550e+03, percent-clipped=8.0 2023-06-27 00:29:32,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1686672.0, ans=0.0 2023-06-27 00:30:16,533 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:30:35,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=15.0 2023-06-27 00:30:48,137 INFO [train.py:996] (3/4) Epoch 10, batch 6700, loss[loss=0.1731, simple_loss=0.2415, pruned_loss=0.05237, over 21842.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2764, pruned_loss=0.06258, over 4278994.87 frames. ], batch size: 107, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:31:27,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1687032.0, ans=0.125 2023-06-27 00:31:44,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687032.0, ans=0.1 2023-06-27 00:31:55,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687092.0, ans=0.1 2023-06-27 00:32:00,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-27 00:32:17,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1687152.0, ans=0.2 2023-06-27 00:32:29,084 INFO [train.py:996] (3/4) Epoch 10, batch 6750, loss[loss=0.1923, simple_loss=0.2592, pruned_loss=0.0627, over 21405.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2746, pruned_loss=0.06273, over 4268923.77 frames. ], batch size: 131, lr: 2.98e-03, grad_scale: 8.0 2023-06-27 00:32:41,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.839e+02 5.646e+02 8.043e+02 1.106e+03 2.898e+03, threshold=1.609e+03, percent-clipped=7.0 2023-06-27 00:32:49,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1687272.0, ans=0.0 2023-06-27 00:33:25,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1687332.0, ans=0.125 2023-06-27 00:33:38,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1687392.0, ans=0.0 2023-06-27 00:34:13,855 INFO [train.py:996] (3/4) Epoch 10, batch 6800, loss[loss=0.1847, simple_loss=0.2501, pruned_loss=0.0597, over 21620.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2776, pruned_loss=0.06594, over 4273376.81 frames. ], batch size: 263, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:34:31,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-27 00:34:34,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1687572.0, ans=0.2 2023-06-27 00:35:31,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1687692.0, ans=0.125 2023-06-27 00:35:33,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1687692.0, ans=0.0 2023-06-27 00:35:40,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1687752.0, ans=0.2 2023-06-27 00:35:51,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=22.5 2023-06-27 00:36:00,654 INFO [train.py:996] (3/4) Epoch 10, batch 6850, loss[loss=0.2144, simple_loss=0.2812, pruned_loss=0.07381, over 21279.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2751, pruned_loss=0.06652, over 4271080.23 frames. ], batch size: 176, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:36:07,588 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 5.578e+02 7.964e+02 1.217e+03 2.059e+03, threshold=1.593e+03, percent-clipped=9.0 2023-06-27 00:36:20,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1687812.0, ans=0.0 2023-06-27 00:36:22,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1687872.0, ans=0.125 2023-06-27 00:36:23,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1687872.0, ans=0.0 2023-06-27 00:37:05,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1687932.0, ans=0.0 2023-06-27 00:37:08,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1687992.0, ans=0.125 2023-06-27 00:37:08,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-27 00:37:47,331 INFO [train.py:996] (3/4) Epoch 10, batch 6900, loss[loss=0.2207, simple_loss=0.2979, pruned_loss=0.07178, over 21803.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2764, pruned_loss=0.06627, over 4278746.22 frames. ], batch size: 112, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:38:18,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-06-27 00:38:46,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1688232.0, ans=0.0 2023-06-27 00:38:48,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1688232.0, ans=0.125 2023-06-27 00:39:05,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1688292.0, ans=0.07 2023-06-27 00:39:22,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1688352.0, ans=0.04949747468305833 2023-06-27 00:39:33,764 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=12.0 2023-06-27 00:39:38,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1688352.0, ans=0.0 2023-06-27 00:39:41,213 INFO [train.py:996] (3/4) Epoch 10, batch 6950, loss[loss=0.2134, simple_loss=0.2955, pruned_loss=0.0656, over 21605.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2792, pruned_loss=0.06394, over 4281523.61 frames. ], batch size: 230, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:39:47,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.023e+02 6.673e+02 8.913e+02 1.216e+03 2.486e+03, threshold=1.783e+03, percent-clipped=9.0 2023-06-27 00:39:48,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1688412.0, ans=0.125 2023-06-27 00:40:18,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1688472.0, ans=0.1 2023-06-27 00:40:43,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1688592.0, ans=0.035 2023-06-27 00:40:52,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1688592.0, ans=0.2 2023-06-27 00:40:54,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1688592.0, ans=0.125 2023-06-27 00:40:54,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1688592.0, ans=0.125 2023-06-27 00:41:18,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1688652.0, ans=0.125 2023-06-27 00:41:28,527 INFO [train.py:996] (3/4) Epoch 10, batch 7000, loss[loss=0.1895, simple_loss=0.2632, pruned_loss=0.05791, over 21809.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2832, pruned_loss=0.06687, over 4285106.67 frames. ], batch size: 118, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:41:31,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1688712.0, ans=0.125 2023-06-27 00:41:36,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1688712.0, ans=0.2 2023-06-27 00:41:51,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1688772.0, ans=0.0 2023-06-27 00:42:50,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-27 00:43:15,461 INFO [train.py:996] (3/4) Epoch 10, batch 7050, loss[loss=0.2039, simple_loss=0.2956, pruned_loss=0.05603, over 21646.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2798, pruned_loss=0.06527, over 4281694.61 frames. ], batch size: 441, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:43:19,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1689012.0, ans=0.2 2023-06-27 00:43:27,752 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.782e+02 6.761e+02 1.057e+03 1.502e+03 3.144e+03, threshold=2.115e+03, percent-clipped=16.0 2023-06-27 00:43:44,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1689072.0, ans=0.125 2023-06-27 00:43:48,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1689072.0, ans=0.125 2023-06-27 00:44:14,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1689132.0, ans=0.125 2023-06-27 00:45:09,743 INFO [train.py:996] (3/4) Epoch 10, batch 7100, loss[loss=0.1549, simple_loss=0.2325, pruned_loss=0.03865, over 16554.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2862, pruned_loss=0.06695, over 4272955.22 frames. ], batch size: 61, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:45:11,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-27 00:45:26,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1689312.0, ans=0.125 2023-06-27 00:47:02,327 INFO [train.py:996] (3/4) Epoch 10, batch 7150, loss[loss=0.2351, simple_loss=0.3121, pruned_loss=0.07905, over 21634.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2852, pruned_loss=0.06596, over 4274091.04 frames. ], batch size: 389, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:47:04,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1689612.0, ans=0.125 2023-06-27 00:47:09,228 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.861e+02 6.064e+02 8.725e+02 1.357e+03 2.823e+03, threshold=1.745e+03, percent-clipped=6.0 2023-06-27 00:47:14,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1689612.0, ans=0.0 2023-06-27 00:47:20,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1689672.0, ans=0.125 2023-06-27 00:48:08,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1689792.0, ans=0.0 2023-06-27 00:48:27,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1689852.0, ans=0.125 2023-06-27 00:48:48,717 INFO [train.py:996] (3/4) Epoch 10, batch 7200, loss[loss=0.2257, simple_loss=0.278, pruned_loss=0.08669, over 21364.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2884, pruned_loss=0.06848, over 4265907.49 frames. ], batch size: 473, lr: 2.98e-03, grad_scale: 32.0 2023-06-27 00:49:08,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1689972.0, ans=0.0 2023-06-27 00:49:18,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1689972.0, ans=0.125 2023-06-27 00:49:25,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1689972.0, ans=0.125 2023-06-27 00:49:25,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1689972.0, ans=0.0 2023-06-27 00:50:02,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1690092.0, ans=0.125 2023-06-27 00:50:34,337 INFO [train.py:996] (3/4) Epoch 10, batch 7250, loss[loss=0.1879, simple_loss=0.2667, pruned_loss=0.05458, over 21743.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2838, pruned_loss=0.06784, over 4270500.70 frames. ], batch size: 112, lr: 2.98e-03, grad_scale: 32.0 2023-06-27 00:50:40,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.886e+02 6.230e+02 8.378e+02 1.198e+03 2.214e+03, threshold=1.676e+03, percent-clipped=4.0 2023-06-27 00:50:48,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1690212.0, ans=0.2 2023-06-27 00:51:02,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1690272.0, ans=0.125 2023-06-27 00:51:07,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1690272.0, ans=0.125 2023-06-27 00:51:30,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1690332.0, ans=0.125 2023-06-27 00:51:54,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-27 00:52:11,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1690452.0, ans=0.125 2023-06-27 00:52:18,835 INFO [train.py:996] (3/4) Epoch 10, batch 7300, loss[loss=0.1845, simple_loss=0.2523, pruned_loss=0.05839, over 21656.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2776, pruned_loss=0.06662, over 4269240.39 frames. ], batch size: 333, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:52:24,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1690512.0, ans=0.125 2023-06-27 00:52:31,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1690512.0, ans=0.125 2023-06-27 00:52:50,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1690572.0, ans=0.125 2023-06-27 00:52:53,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1690572.0, ans=0.125 2023-06-27 00:52:57,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1690632.0, ans=0.0 2023-06-27 00:53:13,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1690632.0, ans=0.125 2023-06-27 00:54:05,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1690812.0, ans=0.1 2023-06-27 00:54:06,772 INFO [train.py:996] (3/4) Epoch 10, batch 7350, loss[loss=0.2176, simple_loss=0.2902, pruned_loss=0.07248, over 21303.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2766, pruned_loss=0.06707, over 4272492.41 frames. ], batch size: 176, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:54:15,734 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.980e+02 5.910e+02 7.871e+02 1.338e+03 3.655e+03, threshold=1.574e+03, percent-clipped=15.0 2023-06-27 00:54:20,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-27 00:54:53,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1690932.0, ans=0.125 2023-06-27 00:55:13,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1690932.0, ans=0.125 2023-06-27 00:55:39,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1691052.0, ans=0.125 2023-06-27 00:55:56,541 INFO [train.py:996] (3/4) Epoch 10, batch 7400, loss[loss=0.2057, simple_loss=0.3027, pruned_loss=0.0543, over 21624.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2821, pruned_loss=0.06876, over 4274702.08 frames. ], batch size: 389, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:56:09,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1691112.0, ans=0.0 2023-06-27 00:56:13,572 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-27 00:57:08,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1691292.0, ans=0.0 2023-06-27 00:57:20,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1691292.0, ans=0.1 2023-06-27 00:57:22,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1691292.0, ans=0.95 2023-06-27 00:57:32,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1691352.0, ans=0.125 2023-06-27 00:57:32,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1691352.0, ans=0.1 2023-06-27 00:57:39,534 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 00:57:42,487 INFO [train.py:996] (3/4) Epoch 10, batch 7450, loss[loss=0.1924, simple_loss=0.2631, pruned_loss=0.06085, over 21826.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2812, pruned_loss=0.06755, over 4275753.70 frames. ], batch size: 352, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:57:46,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1691412.0, ans=0.0 2023-06-27 00:57:56,770 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.982e+02 5.896e+02 9.357e+02 1.491e+03 2.777e+03, threshold=1.871e+03, percent-clipped=18.0 2023-06-27 00:58:06,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1691412.0, ans=0.125 2023-06-27 00:59:13,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1691652.0, ans=0.125 2023-06-27 00:59:37,955 INFO [train.py:996] (3/4) Epoch 10, batch 7500, loss[loss=0.2247, simple_loss=0.3236, pruned_loss=0.0629, over 21444.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2871, pruned_loss=0.06837, over 4276788.93 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 00:59:49,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1691712.0, ans=0.1 2023-06-27 01:00:37,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1691832.0, ans=0.1 2023-06-27 01:00:56,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1691892.0, ans=0.125 2023-06-27 01:01:31,383 INFO [train.py:996] (3/4) Epoch 10, batch 7550, loss[loss=0.1847, simple_loss=0.2406, pruned_loss=0.06442, over 20239.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2937, pruned_loss=0.06785, over 4273392.02 frames. ], batch size: 702, lr: 2.98e-03, grad_scale: 16.0 2023-06-27 01:01:39,828 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.177e+02 6.369e+02 9.874e+02 1.839e+03 3.635e+03, threshold=1.975e+03, percent-clipped=22.0 2023-06-27 01:01:50,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-27 01:02:39,533 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:02:44,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1692192.0, ans=0.1 2023-06-27 01:03:11,969 INFO [train.py:996] (3/4) Epoch 10, batch 7600, loss[loss=0.2226, simple_loss=0.2914, pruned_loss=0.07691, over 21801.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2912, pruned_loss=0.06717, over 4272987.39 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:03:12,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1692312.0, ans=0.0 2023-06-27 01:03:24,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-27 01:03:30,329 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-27 01:03:37,435 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=22.5 2023-06-27 01:04:11,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1692432.0, ans=0.125 2023-06-27 01:04:12,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1692432.0, ans=0.125 2023-06-27 01:05:03,909 INFO [train.py:996] (3/4) Epoch 10, batch 7650, loss[loss=0.2291, simple_loss=0.3032, pruned_loss=0.07752, over 21891.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2903, pruned_loss=0.06791, over 4278739.27 frames. ], batch size: 124, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:05:12,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.938e+02 5.695e+02 7.737e+02 9.992e+02 2.893e+03, threshold=1.547e+03, percent-clipped=4.0 2023-06-27 01:05:19,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1692672.0, ans=0.125 2023-06-27 01:05:47,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1692732.0, ans=0.07 2023-06-27 01:06:14,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1692792.0, ans=0.125 2023-06-27 01:06:52,694 INFO [train.py:996] (3/4) Epoch 10, batch 7700, loss[loss=0.2171, simple_loss=0.2907, pruned_loss=0.07171, over 21613.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2921, pruned_loss=0.07006, over 4279458.97 frames. ], batch size: 263, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:07:29,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1692972.0, ans=0.125 2023-06-27 01:07:45,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1693032.0, ans=0.0 2023-06-27 01:07:45,796 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-27 01:07:56,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-27 01:08:01,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1693092.0, ans=0.2 2023-06-27 01:08:24,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1693152.0, ans=0.125 2023-06-27 01:08:32,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-27 01:08:37,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1693152.0, ans=0.2 2023-06-27 01:08:43,850 INFO [train.py:996] (3/4) Epoch 10, batch 7750, loss[loss=0.2198, simple_loss=0.3127, pruned_loss=0.06348, over 21260.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2971, pruned_loss=0.06976, over 4277393.12 frames. ], batch size: 176, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:09:05,027 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.135e+02 8.248e+02 1.279e+03 1.795e+03 4.947e+03, threshold=2.557e+03, percent-clipped=28.0 2023-06-27 01:09:12,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1693272.0, ans=0.1 2023-06-27 01:09:39,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.55 vs. limit=15.0 2023-06-27 01:10:42,229 INFO [train.py:996] (3/4) Epoch 10, batch 7800, loss[loss=0.2265, simple_loss=0.3105, pruned_loss=0.07127, over 21836.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3007, pruned_loss=0.07106, over 4281588.05 frames. ], batch size: 372, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:10:52,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1693512.0, ans=0.2 2023-06-27 01:11:41,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1693692.0, ans=0.2 2023-06-27 01:11:51,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1693692.0, ans=0.125 2023-06-27 01:11:53,619 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=22.5 2023-06-27 01:12:04,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1693752.0, ans=0.1 2023-06-27 01:12:12,622 INFO [train.py:996] (3/4) Epoch 10, batch 7850, loss[loss=0.1906, simple_loss=0.256, pruned_loss=0.06257, over 21487.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2929, pruned_loss=0.06959, over 4276226.19 frames. ], batch size: 195, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:12:32,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.059e+02 5.917e+02 8.514e+02 1.468e+03 3.815e+03, threshold=1.703e+03, percent-clipped=5.0 2023-06-27 01:12:58,700 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-27 01:13:37,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1693992.0, ans=0.125 2023-06-27 01:13:48,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1694052.0, ans=0.0 2023-06-27 01:14:08,068 INFO [train.py:996] (3/4) Epoch 10, batch 7900, loss[loss=0.2063, simple_loss=0.2874, pruned_loss=0.0626, over 21573.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2867, pruned_loss=0.068, over 4264563.02 frames. ], batch size: 230, lr: 2.97e-03, grad_scale: 8.0 2023-06-27 01:14:40,456 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-06-27 01:14:48,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1694232.0, ans=0.125 2023-06-27 01:15:09,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=22.5 2023-06-27 01:15:39,265 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:15:39,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1694352.0, ans=6.0 2023-06-27 01:16:03,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1694412.0, ans=0.1 2023-06-27 01:16:04,747 INFO [train.py:996] (3/4) Epoch 10, batch 7950, loss[loss=0.2651, simple_loss=0.3393, pruned_loss=0.09549, over 21519.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2912, pruned_loss=0.06728, over 4262194.64 frames. ], batch size: 507, lr: 2.97e-03, grad_scale: 8.0 2023-06-27 01:16:15,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1694412.0, ans=0.125 2023-06-27 01:16:16,926 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.966e+02 5.576e+02 7.742e+02 1.234e+03 3.670e+03, threshold=1.548e+03, percent-clipped=16.0 2023-06-27 01:16:38,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1694472.0, ans=0.2 2023-06-27 01:16:50,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1694532.0, ans=0.0 2023-06-27 01:16:50,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1694532.0, ans=0.125 2023-06-27 01:16:52,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-27 01:17:08,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1694592.0, ans=0.125 2023-06-27 01:17:55,663 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.90 vs. limit=15.0 2023-06-27 01:17:56,262 INFO [train.py:996] (3/4) Epoch 10, batch 8000, loss[loss=0.3069, simple_loss=0.3816, pruned_loss=0.1161, over 21400.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2963, pruned_loss=0.06955, over 4264281.24 frames. ], batch size: 507, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:18:26,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1694772.0, ans=0.0 2023-06-27 01:18:42,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1694772.0, ans=6.0 2023-06-27 01:19:07,255 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:20:02,336 INFO [train.py:996] (3/4) Epoch 10, batch 8050, loss[loss=0.1893, simple_loss=0.2465, pruned_loss=0.06602, over 21248.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2982, pruned_loss=0.06935, over 4263568.08 frames. ], batch size: 143, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:20:06,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1695012.0, ans=0.0 2023-06-27 01:20:14,618 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.526e+02 7.082e+02 1.044e+03 1.392e+03 2.627e+03, threshold=2.088e+03, percent-clipped=20.0 2023-06-27 01:20:41,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-27 01:20:49,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1695132.0, ans=12.0 2023-06-27 01:20:59,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1695132.0, ans=0.125 2023-06-27 01:21:20,768 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-27 01:21:51,392 INFO [train.py:996] (3/4) Epoch 10, batch 8100, loss[loss=0.1826, simple_loss=0.24, pruned_loss=0.06264, over 20009.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2964, pruned_loss=0.06979, over 4270705.23 frames. ], batch size: 703, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:21:58,402 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-27 01:22:09,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1695372.0, ans=0.05 2023-06-27 01:22:34,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1695372.0, ans=0.125 2023-06-27 01:22:50,078 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-27 01:23:37,712 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-27 01:23:50,267 INFO [train.py:996] (3/4) Epoch 10, batch 8150, loss[loss=0.2077, simple_loss=0.2571, pruned_loss=0.0791, over 20283.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3048, pruned_loss=0.07244, over 4265757.33 frames. ], batch size: 703, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:23:51,768 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-27 01:24:03,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1695612.0, ans=10.0 2023-06-27 01:24:07,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.023e+02 5.816e+02 8.551e+02 1.587e+03 5.169e+03, threshold=1.710e+03, percent-clipped=17.0 2023-06-27 01:24:11,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1695672.0, ans=0.2 2023-06-27 01:24:39,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1695732.0, ans=0.1 2023-06-27 01:24:40,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-27 01:24:42,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1695732.0, ans=0.125 2023-06-27 01:25:38,329 INFO [train.py:996] (3/4) Epoch 10, batch 8200, loss[loss=0.1803, simple_loss=0.244, pruned_loss=0.05829, over 21609.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2947, pruned_loss=0.06957, over 4252744.56 frames. ], batch size: 231, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:26:11,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1695972.0, ans=0.125 2023-06-27 01:27:10,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1696152.0, ans=0.125 2023-06-27 01:27:32,701 INFO [train.py:996] (3/4) Epoch 10, batch 8250, loss[loss=0.2349, simple_loss=0.3659, pruned_loss=0.05197, over 20760.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2946, pruned_loss=0.06912, over 4258725.75 frames. ], batch size: 607, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:27:36,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1696212.0, ans=0.0 2023-06-27 01:27:44,588 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.725e+02 5.485e+02 7.641e+02 1.335e+03 2.771e+03, threshold=1.528e+03, percent-clipped=11.0 2023-06-27 01:28:02,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1696272.0, ans=0.0 2023-06-27 01:29:21,567 INFO [train.py:996] (3/4) Epoch 10, batch 8300, loss[loss=0.1912, simple_loss=0.2854, pruned_loss=0.04847, over 21712.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2946, pruned_loss=0.06651, over 4262627.24 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:29:34,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1696512.0, ans=0.1 2023-06-27 01:30:38,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.94 vs. limit=15.0 2023-06-27 01:31:11,524 INFO [train.py:996] (3/4) Epoch 10, batch 8350, loss[loss=0.1907, simple_loss=0.269, pruned_loss=0.05616, over 21451.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2943, pruned_loss=0.06512, over 4264806.20 frames. ], batch size: 212, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:31:12,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1696812.0, ans=0.1 2023-06-27 01:31:15,952 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-27 01:31:21,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=22.5 2023-06-27 01:31:23,480 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.611e+02 5.774e+02 7.528e+02 1.140e+03 3.100e+03, threshold=1.506e+03, percent-clipped=11.0 2023-06-27 01:33:01,162 INFO [train.py:996] (3/4) Epoch 10, batch 8400, loss[loss=0.2428, simple_loss=0.3304, pruned_loss=0.07761, over 21502.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2931, pruned_loss=0.06375, over 4265142.60 frames. ], batch size: 508, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:33:19,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1697172.0, ans=0.125 2023-06-27 01:33:23,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1697172.0, ans=0.2 2023-06-27 01:34:06,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1697292.0, ans=0.2 2023-06-27 01:34:23,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1697292.0, ans=0.125 2023-06-27 01:34:25,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1697352.0, ans=0.0 2023-06-27 01:34:47,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1697412.0, ans=0.125 2023-06-27 01:34:48,808 INFO [train.py:996] (3/4) Epoch 10, batch 8450, loss[loss=0.1984, simple_loss=0.2754, pruned_loss=0.06071, over 21828.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2928, pruned_loss=0.06323, over 4276079.06 frames. ], batch size: 124, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:35:02,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.213e+02 7.215e+02 1.072e+03 1.642e+03 3.949e+03, threshold=2.143e+03, percent-clipped=30.0 2023-06-27 01:35:04,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1697472.0, ans=0.125 2023-06-27 01:35:09,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1697472.0, ans=0.125 2023-06-27 01:35:42,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1697532.0, ans=0.125 2023-06-27 01:35:58,541 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:36:29,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1697652.0, ans=0.125 2023-06-27 01:36:36,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1697712.0, ans=0.125 2023-06-27 01:36:37,976 INFO [train.py:996] (3/4) Epoch 10, batch 8500, loss[loss=0.1915, simple_loss=0.2537, pruned_loss=0.06463, over 21477.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2897, pruned_loss=0.06458, over 4275901.93 frames. ], batch size: 212, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:36:40,927 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-27 01:36:51,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1697712.0, ans=0.1 2023-06-27 01:36:58,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.97 vs. limit=15.0 2023-06-27 01:37:04,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.61 vs. limit=22.5 2023-06-27 01:38:04,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1697952.0, ans=0.125 2023-06-27 01:38:24,067 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-27 01:38:28,156 INFO [train.py:996] (3/4) Epoch 10, batch 8550, loss[loss=0.2065, simple_loss=0.3, pruned_loss=0.05652, over 21799.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2923, pruned_loss=0.06657, over 4271304.46 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:38:37,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1698012.0, ans=0.125 2023-06-27 01:38:41,899 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.171e+02 1.011e+03 1.607e+03 3.555e+03, threshold=2.023e+03, percent-clipped=12.0 2023-06-27 01:39:10,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=15.0 2023-06-27 01:39:21,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1698132.0, ans=0.2 2023-06-27 01:40:17,115 INFO [train.py:996] (3/4) Epoch 10, batch 8600, loss[loss=0.2303, simple_loss=0.313, pruned_loss=0.07382, over 21820.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2985, pruned_loss=0.06883, over 4275526.61 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:41:23,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1698432.0, ans=0.2 2023-06-27 01:41:38,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1698492.0, ans=0.125 2023-06-27 01:41:42,707 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=15.0 2023-06-27 01:42:05,294 INFO [train.py:996] (3/4) Epoch 10, batch 8650, loss[loss=0.2159, simple_loss=0.3158, pruned_loss=0.05796, over 21627.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3043, pruned_loss=0.06962, over 4275978.64 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:42:24,803 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.451e+02 5.765e+02 7.630e+02 1.183e+03 2.009e+03, threshold=1.526e+03, percent-clipped=0.0 2023-06-27 01:42:38,355 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.11 vs. limit=22.5 2023-06-27 01:42:47,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1698672.0, ans=0.0 2023-06-27 01:42:51,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1698732.0, ans=0.07 2023-06-27 01:42:52,574 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 01:43:09,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1698732.0, ans=0.0 2023-06-27 01:43:50,334 INFO [train.py:996] (3/4) Epoch 10, batch 8700, loss[loss=0.2076, simple_loss=0.2666, pruned_loss=0.07426, over 21578.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2952, pruned_loss=0.06646, over 4273267.40 frames. ], batch size: 414, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:44:46,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1699032.0, ans=0.0 2023-06-27 01:45:32,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1699152.0, ans=0.0 2023-06-27 01:45:34,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1699152.0, ans=0.0 2023-06-27 01:45:39,134 INFO [train.py:996] (3/4) Epoch 10, batch 8750, loss[loss=0.2076, simple_loss=0.2772, pruned_loss=0.06899, over 21835.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2893, pruned_loss=0.06668, over 4281005.31 frames. ], batch size: 391, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:45:39,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1699212.0, ans=0.125 2023-06-27 01:45:58,739 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.80 vs. limit=10.0 2023-06-27 01:45:59,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.645e+02 6.087e+02 8.152e+02 1.140e+03 2.309e+03, threshold=1.630e+03, percent-clipped=11.0 2023-06-27 01:46:00,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1699212.0, ans=15.0 2023-06-27 01:46:01,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1699272.0, ans=0.125 2023-06-27 01:46:25,876 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=22.5 2023-06-27 01:47:16,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1699452.0, ans=0.125 2023-06-27 01:47:34,981 INFO [train.py:996] (3/4) Epoch 10, batch 8800, loss[loss=0.2676, simple_loss=0.3543, pruned_loss=0.09048, over 21610.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2992, pruned_loss=0.07003, over 4279007.31 frames. ], batch size: 389, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 01:48:40,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1699632.0, ans=0.125 2023-06-27 01:48:45,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1699692.0, ans=0.0 2023-06-27 01:49:24,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-27 01:49:33,048 INFO [train.py:996] (3/4) Epoch 10, batch 8850, loss[loss=0.2061, simple_loss=0.2799, pruned_loss=0.06614, over 21404.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3052, pruned_loss=0.07112, over 4279247.88 frames. ], batch size: 194, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:49:48,550 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.063e+02 5.642e+02 7.591e+02 1.245e+03 2.739e+03, threshold=1.518e+03, percent-clipped=14.0 2023-06-27 01:51:01,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1700052.0, ans=0.125 2023-06-27 01:51:22,888 INFO [train.py:996] (3/4) Epoch 10, batch 8900, loss[loss=0.2142, simple_loss=0.2863, pruned_loss=0.07102, over 22042.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2994, pruned_loss=0.07005, over 4279978.25 frames. ], batch size: 103, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:51:48,546 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-27 01:53:20,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1700412.0, ans=0.1 2023-06-27 01:53:21,315 INFO [train.py:996] (3/4) Epoch 10, batch 8950, loss[loss=0.1988, simple_loss=0.2666, pruned_loss=0.06553, over 21479.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3005, pruned_loss=0.06963, over 4271483.98 frames. ], batch size: 195, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:53:42,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.709e+02 6.064e+02 9.607e+02 1.976e+03 3.801e+03, threshold=1.921e+03, percent-clipped=34.0 2023-06-27 01:53:43,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1700472.0, ans=0.125 2023-06-27 01:53:53,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1700472.0, ans=0.2 2023-06-27 01:53:55,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1700472.0, ans=0.0 2023-06-27 01:54:01,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1700532.0, ans=0.125 2023-06-27 01:54:48,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1700652.0, ans=0.0 2023-06-27 01:55:09,679 INFO [train.py:996] (3/4) Epoch 10, batch 9000, loss[loss=0.2076, simple_loss=0.2825, pruned_loss=0.06632, over 21664.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2962, pruned_loss=0.06924, over 4275487.12 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:55:09,680 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 01:55:27,999 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2678, simple_loss=0.3533, pruned_loss=0.09113, over 1796401.00 frames. 2023-06-27 01:55:28,000 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 01:56:36,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1700892.0, ans=0.125 2023-06-27 01:56:58,068 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.10 vs. limit=10.0 2023-06-27 01:57:02,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1700952.0, ans=0.0 2023-06-27 01:57:03,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.57 vs. limit=15.0 2023-06-27 01:57:23,157 INFO [train.py:996] (3/4) Epoch 10, batch 9050, loss[loss=0.2888, simple_loss=0.3435, pruned_loss=0.1171, over 21403.00 frames. ], tot_loss[loss=0.212, simple_loss=0.291, pruned_loss=0.06647, over 4275699.64 frames. ], batch size: 509, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:57:34,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1701012.0, ans=0.1 2023-06-27 01:57:45,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.665e+02 7.496e+02 1.289e+03 1.830e+03 3.310e+03, threshold=2.578e+03, percent-clipped=22.0 2023-06-27 01:57:47,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1701072.0, ans=0.125 2023-06-27 01:57:58,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1701072.0, ans=0.125 2023-06-27 01:58:18,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1701132.0, ans=0.1 2023-06-27 01:59:13,803 INFO [train.py:996] (3/4) Epoch 10, batch 9100, loss[loss=0.254, simple_loss=0.3417, pruned_loss=0.08317, over 21660.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2938, pruned_loss=0.06831, over 4271607.01 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 01:59:14,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1701312.0, ans=0.125 2023-06-27 01:59:27,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1701312.0, ans=0.125 2023-06-27 02:01:09,259 INFO [train.py:996] (3/4) Epoch 10, batch 9150, loss[loss=0.2365, simple_loss=0.3343, pruned_loss=0.06939, over 21655.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2988, pruned_loss=0.06685, over 4269502.21 frames. ], batch size: 389, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:01:11,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1701612.0, ans=0.125 2023-06-27 02:01:11,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1701612.0, ans=0.0 2023-06-27 02:01:24,807 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.482e+02 5.209e+02 7.364e+02 1.147e+03 3.350e+03, threshold=1.473e+03, percent-clipped=3.0 2023-06-27 02:01:54,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1701732.0, ans=0.0 2023-06-27 02:02:04,323 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-06-27 02:02:49,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-27 02:02:58,995 INFO [train.py:996] (3/4) Epoch 10, batch 9200, loss[loss=0.2268, simple_loss=0.3015, pruned_loss=0.07606, over 21818.00 frames. ], tot_loss[loss=0.217, simple_loss=0.3013, pruned_loss=0.06632, over 4265985.56 frames. ], batch size: 118, lr: 2.97e-03, grad_scale: 32.0 2023-06-27 02:03:16,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1701972.0, ans=0.125 2023-06-27 02:03:36,241 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-27 02:04:45,290 INFO [train.py:996] (3/4) Epoch 10, batch 9250, loss[loss=0.2113, simple_loss=0.277, pruned_loss=0.07276, over 21982.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3035, pruned_loss=0.06908, over 4268979.20 frames. ], batch size: 103, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:04:52,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1702212.0, ans=0.1 2023-06-27 02:05:02,704 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.922e+02 6.299e+02 8.423e+02 1.393e+03 3.022e+03, threshold=1.685e+03, percent-clipped=24.0 2023-06-27 02:05:22,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1702272.0, ans=0.125 2023-06-27 02:06:35,087 INFO [train.py:996] (3/4) Epoch 10, batch 9300, loss[loss=0.188, simple_loss=0.2656, pruned_loss=0.0552, over 21559.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2971, pruned_loss=0.06811, over 4270311.82 frames. ], batch size: 247, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:06:37,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1702512.0, ans=0.0 2023-06-27 02:06:53,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1702512.0, ans=0.125 2023-06-27 02:07:52,067 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-27 02:08:02,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1702752.0, ans=0.125 2023-06-27 02:08:19,215 INFO [train.py:996] (3/4) Epoch 10, batch 9350, loss[loss=0.2416, simple_loss=0.3301, pruned_loss=0.07649, over 21434.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3034, pruned_loss=0.06875, over 4269665.90 frames. ], batch size: 131, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:08:34,050 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 02:08:47,198 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.895e+02 6.669e+02 9.528e+02 1.719e+03 4.361e+03, threshold=1.906e+03, percent-clipped=26.0 2023-06-27 02:08:51,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1702872.0, ans=0.0 2023-06-27 02:09:14,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1702932.0, ans=0.125 2023-06-27 02:09:51,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1703052.0, ans=0.0 2023-06-27 02:10:18,893 INFO [train.py:996] (3/4) Epoch 10, batch 9400, loss[loss=0.1832, simple_loss=0.2543, pruned_loss=0.05607, over 21556.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3077, pruned_loss=0.07, over 4269906.55 frames. ], batch size: 195, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:10:19,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1703112.0, ans=0.04949747468305833 2023-06-27 02:10:59,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.61 vs. limit=15.0 2023-06-27 02:11:07,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1703232.0, ans=0.2 2023-06-27 02:11:26,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1703292.0, ans=0.0 2023-06-27 02:11:32,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1703292.0, ans=0.025 2023-06-27 02:11:46,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1703352.0, ans=0.0 2023-06-27 02:12:02,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-27 02:12:05,127 INFO [train.py:996] (3/4) Epoch 10, batch 9450, loss[loss=0.2029, simple_loss=0.2628, pruned_loss=0.07155, over 21451.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2988, pruned_loss=0.06874, over 4266813.31 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-27 02:12:10,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1703412.0, ans=15.0 2023-06-27 02:12:21,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1703472.0, ans=0.125 2023-06-27 02:12:22,341 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.110e+02 5.502e+02 7.576e+02 1.129e+03 2.324e+03, threshold=1.515e+03, percent-clipped=5.0 2023-06-27 02:12:40,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1703472.0, ans=0.125 2023-06-27 02:13:32,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1703652.0, ans=0.125 2023-06-27 02:13:52,556 INFO [train.py:996] (3/4) Epoch 10, batch 9500, loss[loss=0.1642, simple_loss=0.2563, pruned_loss=0.03604, over 21623.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2911, pruned_loss=0.06707, over 4273954.55 frames. ], batch size: 263, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:14:43,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1703832.0, ans=0.0 2023-06-27 02:14:51,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1703832.0, ans=0.1 2023-06-27 02:15:30,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1703952.0, ans=0.0 2023-06-27 02:15:42,677 INFO [train.py:996] (3/4) Epoch 10, batch 9550, loss[loss=0.2401, simple_loss=0.3141, pruned_loss=0.08309, over 21948.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2952, pruned_loss=0.06896, over 4275491.10 frames. ], batch size: 372, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:15:53,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1704012.0, ans=0.125 2023-06-27 02:16:03,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1704072.0, ans=0.125 2023-06-27 02:16:04,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.287e+02 6.617e+02 9.297e+02 1.429e+03 3.226e+03, threshold=1.859e+03, percent-clipped=22.0 2023-06-27 02:16:13,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.69 vs. limit=15.0 2023-06-27 02:16:52,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1704192.0, ans=0.0 2023-06-27 02:17:29,878 INFO [train.py:996] (3/4) Epoch 10, batch 9600, loss[loss=0.1946, simple_loss=0.2776, pruned_loss=0.05581, over 21854.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2989, pruned_loss=0.07096, over 4282180.15 frames. ], batch size: 332, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:18:24,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.07 vs. limit=10.0 2023-06-27 02:18:26,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1704432.0, ans=0.0 2023-06-27 02:18:30,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=12.0 2023-06-27 02:19:26,582 INFO [train.py:996] (3/4) Epoch 10, batch 9650, loss[loss=0.2484, simple_loss=0.3163, pruned_loss=0.09023, over 21630.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2985, pruned_loss=0.071, over 4288017.62 frames. ], batch size: 263, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:19:45,810 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.795e+02 6.257e+02 8.564e+02 1.301e+03 2.812e+03, threshold=1.713e+03, percent-clipped=7.0 2023-06-27 02:20:47,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1704792.0, ans=15.0 2023-06-27 02:20:50,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-27 02:21:02,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1704852.0, ans=0.125 2023-06-27 02:21:15,566 INFO [train.py:996] (3/4) Epoch 10, batch 9700, loss[loss=0.1872, simple_loss=0.2698, pruned_loss=0.05233, over 21602.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3031, pruned_loss=0.0717, over 4285662.22 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:21:39,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1704972.0, ans=0.125 2023-06-27 02:23:03,762 INFO [train.py:996] (3/4) Epoch 10, batch 9750, loss[loss=0.234, simple_loss=0.3251, pruned_loss=0.07147, over 21829.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2962, pruned_loss=0.07004, over 4281818.52 frames. ], batch size: 107, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:23:26,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1705272.0, ans=0.2 2023-06-27 02:23:27,953 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.191e+02 6.700e+02 1.068e+03 1.546e+03 3.673e+03, threshold=2.135e+03, percent-clipped=19.0 2023-06-27 02:23:44,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.15 vs. limit=22.5 2023-06-27 02:24:39,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1705452.0, ans=0.125 2023-06-27 02:24:45,099 INFO [train.py:996] (3/4) Epoch 10, batch 9800, loss[loss=0.205, simple_loss=0.2767, pruned_loss=0.06667, over 21796.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2962, pruned_loss=0.07059, over 4280610.62 frames. ], batch size: 247, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:24:57,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1705512.0, ans=0.125 2023-06-27 02:26:38,260 INFO [train.py:996] (3/4) Epoch 10, batch 9850, loss[loss=0.2022, simple_loss=0.2652, pruned_loss=0.06964, over 21396.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2935, pruned_loss=0.07036, over 4271604.77 frames. ], batch size: 194, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:26:59,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1705872.0, ans=0.0 2023-06-27 02:27:02,348 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.847e+02 5.295e+02 7.367e+02 1.134e+03 2.701e+03, threshold=1.473e+03, percent-clipped=3.0 2023-06-27 02:27:06,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1705872.0, ans=0.2 2023-06-27 02:27:08,266 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 02:27:46,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1705992.0, ans=0.125 2023-06-27 02:28:09,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1706052.0, ans=0.125 2023-06-27 02:28:26,484 INFO [train.py:996] (3/4) Epoch 10, batch 9900, loss[loss=0.2923, simple_loss=0.3491, pruned_loss=0.1178, over 21319.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2885, pruned_loss=0.06938, over 4269998.51 frames. ], batch size: 507, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:30:11,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-27 02:30:14,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1706412.0, ans=0.125 2023-06-27 02:30:15,216 INFO [train.py:996] (3/4) Epoch 10, batch 9950, loss[loss=0.1849, simple_loss=0.2512, pruned_loss=0.05935, over 21498.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2894, pruned_loss=0.07002, over 4254207.78 frames. ], batch size: 263, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:30:28,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1706412.0, ans=0.125 2023-06-27 02:30:39,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.384e+02 6.546e+02 9.078e+02 1.320e+03 2.583e+03, threshold=1.816e+03, percent-clipped=18.0 2023-06-27 02:30:45,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1706472.0, ans=0.125 2023-06-27 02:30:46,873 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1706472.0, ans=0.0 2023-06-27 02:30:48,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1706472.0, ans=0.125 2023-06-27 02:31:24,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1706592.0, ans=0.125 2023-06-27 02:31:59,310 INFO [train.py:996] (3/4) Epoch 10, batch 10000, loss[loss=0.2203, simple_loss=0.2909, pruned_loss=0.07483, over 21299.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2844, pruned_loss=0.0686, over 4255456.69 frames. ], batch size: 176, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:32:23,114 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-27 02:32:35,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1706772.0, ans=0.2 2023-06-27 02:33:30,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1706892.0, ans=0.125 2023-06-27 02:33:30,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1706892.0, ans=0.125 2023-06-27 02:33:57,208 INFO [train.py:996] (3/4) Epoch 10, batch 10050, loss[loss=0.1965, simple_loss=0.2681, pruned_loss=0.06248, over 21601.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2874, pruned_loss=0.06981, over 4257183.87 frames. ], batch size: 415, lr: 2.96e-03, grad_scale: 32.0 2023-06-27 02:34:15,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1707072.0, ans=0.125 2023-06-27 02:34:16,292 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 5.853e+02 8.209e+02 1.305e+03 2.955e+03, threshold=1.642e+03, percent-clipped=12.0 2023-06-27 02:34:27,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1707072.0, ans=0.2 2023-06-27 02:35:24,188 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.43 vs. limit=15.0 2023-06-27 02:35:45,579 INFO [train.py:996] (3/4) Epoch 10, batch 10100, loss[loss=0.1652, simple_loss=0.2456, pruned_loss=0.04238, over 21432.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2866, pruned_loss=0.06857, over 4258049.98 frames. ], batch size: 211, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:36:37,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1707432.0, ans=0.2 2023-06-27 02:37:13,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1707552.0, ans=0.125 2023-06-27 02:37:29,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1707552.0, ans=0.1 2023-06-27 02:37:33,970 INFO [train.py:996] (3/4) Epoch 10, batch 10150, loss[loss=0.2089, simple_loss=0.2808, pruned_loss=0.06848, over 21813.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2928, pruned_loss=0.07083, over 4268431.73 frames. ], batch size: 118, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:37:35,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.73 vs. limit=15.0 2023-06-27 02:38:02,105 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.860e+02 5.691e+02 7.969e+02 1.243e+03 2.132e+03, threshold=1.594e+03, percent-clipped=9.0 2023-06-27 02:38:20,842 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=22.5 2023-06-27 02:38:39,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1707732.0, ans=0.1 2023-06-27 02:39:22,075 INFO [train.py:996] (3/4) Epoch 10, batch 10200, loss[loss=0.1938, simple_loss=0.286, pruned_loss=0.05076, over 21790.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.291, pruned_loss=0.06884, over 4269927.67 frames. ], batch size: 333, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:39:26,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1707912.0, ans=0.1 2023-06-27 02:40:33,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=22.5 2023-06-27 02:41:10,257 INFO [train.py:996] (3/4) Epoch 10, batch 10250, loss[loss=0.1492, simple_loss=0.2299, pruned_loss=0.03425, over 21305.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2887, pruned_loss=0.06464, over 4263800.89 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:41:34,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1708272.0, ans=0.04949747468305833 2023-06-27 02:41:43,181 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 02:41:44,120 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.003e+02 5.121e+02 6.832e+02 1.019e+03 2.987e+03, threshold=1.366e+03, percent-clipped=4.0 2023-06-27 02:42:07,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1708332.0, ans=0.125 2023-06-27 02:42:11,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1708332.0, ans=0.125 2023-06-27 02:43:03,435 INFO [train.py:996] (3/4) Epoch 10, batch 10300, loss[loss=0.311, simple_loss=0.3884, pruned_loss=0.1168, over 21378.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2902, pruned_loss=0.06581, over 4262822.97 frames. ], batch size: 507, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:44:29,326 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.70 vs. limit=15.0 2023-06-27 02:44:42,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1708752.0, ans=0.0 2023-06-27 02:45:06,404 INFO [train.py:996] (3/4) Epoch 10, batch 10350, loss[loss=0.2083, simple_loss=0.2883, pruned_loss=0.06416, over 21683.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2918, pruned_loss=0.06575, over 4266407.61 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:45:10,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1708812.0, ans=0.1 2023-06-27 02:45:35,568 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.505e+02 7.876e+02 1.206e+03 1.704e+03 3.503e+03, threshold=2.411e+03, percent-clipped=40.0 2023-06-27 02:45:39,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1708872.0, ans=0.125 2023-06-27 02:46:03,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-27 02:46:24,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1708992.0, ans=0.04949747468305833 2023-06-27 02:46:57,650 INFO [train.py:996] (3/4) Epoch 10, batch 10400, loss[loss=0.2209, simple_loss=0.3037, pruned_loss=0.06902, over 21531.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2857, pruned_loss=0.06479, over 4270705.54 frames. ], batch size: 441, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:47:57,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1709232.0, ans=0.125 2023-06-27 02:47:58,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1709232.0, ans=0.0 2023-06-27 02:48:22,423 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.09 vs. limit=22.5 2023-06-27 02:48:52,944 INFO [train.py:996] (3/4) Epoch 10, batch 10450, loss[loss=0.271, simple_loss=0.3509, pruned_loss=0.09554, over 21632.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2897, pruned_loss=0.06719, over 4266398.86 frames. ], batch size: 414, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:48:58,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1709412.0, ans=0.0 2023-06-27 02:49:21,562 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.083e+02 7.279e+02 1.026e+03 1.542e+03 3.103e+03, threshold=2.052e+03, percent-clipped=9.0 2023-06-27 02:50:41,342 INFO [train.py:996] (3/4) Epoch 10, batch 10500, loss[loss=0.1817, simple_loss=0.2563, pruned_loss=0.05356, over 21745.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2906, pruned_loss=0.06638, over 4271318.30 frames. ], batch size: 316, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:51:16,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1709772.0, ans=0.125 2023-06-27 02:51:33,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1709832.0, ans=0.0 2023-06-27 02:51:35,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1709832.0, ans=0.0 2023-06-27 02:51:45,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1709892.0, ans=0.125 2023-06-27 02:51:51,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=22.5 2023-06-27 02:52:28,679 INFO [train.py:996] (3/4) Epoch 10, batch 10550, loss[loss=0.2016, simple_loss=0.2697, pruned_loss=0.06673, over 21882.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2841, pruned_loss=0.06523, over 4259391.35 frames. ], batch size: 98, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:52:55,938 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.583e+02 5.517e+02 8.817e+02 1.298e+03 2.428e+03, threshold=1.763e+03, percent-clipped=4.0 2023-06-27 02:53:03,350 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-27 02:54:10,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1710252.0, ans=0.0 2023-06-27 02:54:15,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1710312.0, ans=0.125 2023-06-27 02:54:16,526 INFO [train.py:996] (3/4) Epoch 10, batch 10600, loss[loss=0.1652, simple_loss=0.2359, pruned_loss=0.04724, over 21775.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2797, pruned_loss=0.06383, over 4261533.68 frames. ], batch size: 124, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:54:20,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1710312.0, ans=0.125 2023-06-27 02:54:59,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1710432.0, ans=0.125 2023-06-27 02:55:17,686 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-06-27 02:55:31,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1710492.0, ans=0.125 2023-06-27 02:55:57,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1710552.0, ans=0.125 2023-06-27 02:56:13,018 INFO [train.py:996] (3/4) Epoch 10, batch 10650, loss[loss=0.2275, simple_loss=0.2897, pruned_loss=0.0826, over 20061.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2825, pruned_loss=0.06277, over 4259176.10 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 02:56:35,995 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.041e+02 6.303e+02 9.847e+02 1.673e+03 3.050e+03, threshold=1.969e+03, percent-clipped=22.0 2023-06-27 02:56:38,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1710672.0, ans=0.2 2023-06-27 02:56:40,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1710672.0, ans=0.0 2023-06-27 02:56:55,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1710732.0, ans=0.0 2023-06-27 02:57:51,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1710852.0, ans=0.0 2023-06-27 02:58:01,501 INFO [train.py:996] (3/4) Epoch 10, batch 10700, loss[loss=0.2609, simple_loss=0.3344, pruned_loss=0.09369, over 21744.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2805, pruned_loss=0.06263, over 4248483.24 frames. ], batch size: 441, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:58:02,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1710912.0, ans=0.0 2023-06-27 02:58:43,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.whiten.whitening_limit, batch_count=1711032.0, ans=12.0 2023-06-27 02:59:23,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1711092.0, ans=0.2 2023-06-27 02:59:51,980 INFO [train.py:996] (3/4) Epoch 10, batch 10750, loss[loss=0.2223, simple_loss=0.2994, pruned_loss=0.07265, over 21319.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2904, pruned_loss=0.06667, over 4255679.24 frames. ], batch size: 176, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 02:59:55,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1711212.0, ans=0.1 2023-06-27 03:00:21,238 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.422e+02 6.069e+02 8.010e+02 1.380e+03 3.013e+03, threshold=1.602e+03, percent-clipped=10.0 2023-06-27 03:01:20,900 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:01:33,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1711452.0, ans=0.125 2023-06-27 03:01:41,464 INFO [train.py:996] (3/4) Epoch 10, batch 10800, loss[loss=0.2274, simple_loss=0.2922, pruned_loss=0.08128, over 20028.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2938, pruned_loss=0.06665, over 4259161.90 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:01:59,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1711512.0, ans=0.125 2023-06-27 03:02:21,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1711572.0, ans=0.07 2023-06-27 03:02:32,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1711632.0, ans=0.125 2023-06-27 03:02:55,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1711692.0, ans=0.1 2023-06-27 03:03:23,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1711752.0, ans=0.1 2023-06-27 03:03:30,050 INFO [train.py:996] (3/4) Epoch 10, batch 10850, loss[loss=0.1897, simple_loss=0.2663, pruned_loss=0.05655, over 21640.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2943, pruned_loss=0.0672, over 4257680.28 frames. ], batch size: 282, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:04:04,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1711872.0, ans=0.1 2023-06-27 03:04:05,452 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.119e+02 5.251e+02 7.747e+02 1.275e+03 2.663e+03, threshold=1.549e+03, percent-clipped=11.0 2023-06-27 03:04:13,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-27 03:04:48,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1711992.0, ans=0.1 2023-06-27 03:05:15,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1712052.0, ans=0.0 2023-06-27 03:05:19,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1712052.0, ans=0.125 2023-06-27 03:05:23,781 INFO [train.py:996] (3/4) Epoch 10, batch 10900, loss[loss=0.2312, simple_loss=0.3263, pruned_loss=0.06808, over 19788.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2894, pruned_loss=0.06554, over 4262448.47 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:06:32,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1712292.0, ans=0.05 2023-06-27 03:06:48,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1712352.0, ans=0.0 2023-06-27 03:06:53,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1712352.0, ans=0.125 2023-06-27 03:07:11,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1712412.0, ans=0.2 2023-06-27 03:07:12,351 INFO [train.py:996] (3/4) Epoch 10, batch 10950, loss[loss=0.1888, simple_loss=0.2589, pruned_loss=0.05941, over 21827.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2853, pruned_loss=0.0638, over 4266241.71 frames. ], batch size: 107, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:07:29,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1712412.0, ans=0.0 2023-06-27 03:07:31,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-27 03:07:48,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.904e+02 6.171e+02 9.007e+02 1.291e+03 2.424e+03, threshold=1.801e+03, percent-clipped=14.0 2023-06-27 03:08:57,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1712712.0, ans=0.125 2023-06-27 03:08:58,757 INFO [train.py:996] (3/4) Epoch 10, batch 11000, loss[loss=0.1947, simple_loss=0.2625, pruned_loss=0.06347, over 21607.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.285, pruned_loss=0.06402, over 4262326.14 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:09:19,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1712712.0, ans=0.125 2023-06-27 03:10:46,678 INFO [train.py:996] (3/4) Epoch 10, batch 11050, loss[loss=0.175, simple_loss=0.2167, pruned_loss=0.0667, over 20094.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2817, pruned_loss=0.06532, over 4264637.47 frames. ], batch size: 704, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:11:22,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.001e+02 5.814e+02 8.503e+02 1.206e+03 2.810e+03, threshold=1.701e+03, percent-clipped=7.0 2023-06-27 03:11:22,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1713072.0, ans=0.09899494936611666 2023-06-27 03:11:37,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-27 03:11:40,354 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-06-27 03:12:00,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1713192.0, ans=0.125 2023-06-27 03:12:03,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1713192.0, ans=0.125 2023-06-27 03:12:14,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1713252.0, ans=0.125 2023-06-27 03:12:33,199 INFO [train.py:996] (3/4) Epoch 10, batch 11100, loss[loss=0.2113, simple_loss=0.2768, pruned_loss=0.07289, over 21754.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2814, pruned_loss=0.06533, over 4253692.76 frames. ], batch size: 112, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:12:37,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1713312.0, ans=0.125 2023-06-27 03:13:46,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1713492.0, ans=0.125 2023-06-27 03:14:22,306 INFO [train.py:996] (3/4) Epoch 10, batch 11150, loss[loss=0.2052, simple_loss=0.2913, pruned_loss=0.05955, over 21290.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2806, pruned_loss=0.0656, over 4250099.46 frames. ], batch size: 176, lr: 2.96e-03, grad_scale: 8.0 2023-06-27 03:14:40,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1713612.0, ans=0.125 2023-06-27 03:14:43,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1713672.0, ans=0.0 2023-06-27 03:14:58,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.769e+02 5.768e+02 8.894e+02 1.400e+03 2.503e+03, threshold=1.779e+03, percent-clipped=10.0 2023-06-27 03:15:21,823 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-27 03:15:24,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1713732.0, ans=0.125 2023-06-27 03:16:03,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1713852.0, ans=0.025 2023-06-27 03:16:08,616 INFO [train.py:996] (3/4) Epoch 10, batch 11200, loss[loss=0.191, simple_loss=0.2619, pruned_loss=0.05999, over 21608.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2782, pruned_loss=0.06535, over 4249699.07 frames. ], batch size: 332, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:17:30,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1714092.0, ans=0.07 2023-06-27 03:17:36,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-27 03:17:55,858 INFO [train.py:996] (3/4) Epoch 10, batch 11250, loss[loss=0.2195, simple_loss=0.2966, pruned_loss=0.07119, over 21196.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2773, pruned_loss=0.0651, over 4258249.59 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:18:24,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.43 vs. limit=15.0 2023-06-27 03:18:25,561 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:18:26,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 5.382e+02 8.145e+02 1.130e+03 2.477e+03, threshold=1.629e+03, percent-clipped=7.0 2023-06-27 03:18:27,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1714272.0, ans=0.0 2023-06-27 03:18:29,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-27 03:18:31,513 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-27 03:19:38,924 INFO [train.py:996] (3/4) Epoch 10, batch 11300, loss[loss=0.1876, simple_loss=0.2701, pruned_loss=0.0525, over 21275.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2795, pruned_loss=0.06601, over 4262518.48 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:19:44,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1714512.0, ans=0.2 2023-06-27 03:20:22,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1714572.0, ans=0.125 2023-06-27 03:20:31,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1714632.0, ans=0.125 2023-06-27 03:20:32,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1714632.0, ans=0.1 2023-06-27 03:21:02,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1714692.0, ans=0.125 2023-06-27 03:21:04,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1714752.0, ans=0.0 2023-06-27 03:21:18,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1714752.0, ans=0.025 2023-06-27 03:21:22,944 INFO [train.py:996] (3/4) Epoch 10, batch 11350, loss[loss=0.1977, simple_loss=0.2785, pruned_loss=0.05842, over 21710.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2803, pruned_loss=0.06539, over 4257047.37 frames. ], batch size: 263, lr: 2.96e-03, grad_scale: 16.0 2023-06-27 03:22:00,028 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 5.912e+02 7.672e+02 1.183e+03 2.053e+03, threshold=1.534e+03, percent-clipped=10.0 2023-06-27 03:22:50,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1715052.0, ans=0.0 2023-06-27 03:22:53,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1715052.0, ans=0.125 2023-06-27 03:23:12,827 INFO [train.py:996] (3/4) Epoch 10, batch 11400, loss[loss=0.2136, simple_loss=0.3031, pruned_loss=0.06202, over 21837.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2857, pruned_loss=0.06781, over 4266152.15 frames. ], batch size: 317, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:23:13,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1715112.0, ans=0.1 2023-06-27 03:23:16,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.18 vs. limit=15.0 2023-06-27 03:24:34,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1715292.0, ans=0.0 2023-06-27 03:25:07,615 INFO [train.py:996] (3/4) Epoch 10, batch 11450, loss[loss=0.2368, simple_loss=0.3193, pruned_loss=0.07717, over 21705.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2875, pruned_loss=0.06733, over 4258906.70 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:25:16,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1715412.0, ans=0.125 2023-06-27 03:25:33,682 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.931e+02 7.490e+02 1.068e+03 1.427e+03 2.700e+03, threshold=2.136e+03, percent-clipped=19.0 2023-06-27 03:26:03,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1715592.0, ans=0.1 2023-06-27 03:26:28,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1715652.0, ans=0.125 2023-06-27 03:26:31,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1715652.0, ans=0.2 2023-06-27 03:26:50,437 INFO [train.py:996] (3/4) Epoch 10, batch 11500, loss[loss=0.1953, simple_loss=0.2932, pruned_loss=0.0487, over 21799.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2897, pruned_loss=0.06793, over 4264413.11 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:27:03,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-27 03:27:15,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1715772.0, ans=0.1 2023-06-27 03:27:54,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1715892.0, ans=0.2 2023-06-27 03:28:12,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1715892.0, ans=0.95 2023-06-27 03:28:23,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1715952.0, ans=22.5 2023-06-27 03:28:45,000 INFO [train.py:996] (3/4) Epoch 10, batch 11550, loss[loss=0.237, simple_loss=0.3323, pruned_loss=0.07085, over 21838.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2948, pruned_loss=0.06766, over 4271836.07 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:28:47,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1716012.0, ans=0.125 2023-06-27 03:29:00,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1716012.0, ans=0.125 2023-06-27 03:29:16,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1716072.0, ans=0.125 2023-06-27 03:29:17,151 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.702e+02 7.297e+02 1.033e+03 1.557e+03 3.418e+03, threshold=2.066e+03, percent-clipped=10.0 2023-06-27 03:29:23,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1716072.0, ans=0.125 2023-06-27 03:30:03,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1716192.0, ans=0.0 2023-06-27 03:30:32,989 INFO [train.py:996] (3/4) Epoch 10, batch 11600, loss[loss=0.2498, simple_loss=0.3383, pruned_loss=0.08064, over 21482.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3113, pruned_loss=0.07003, over 4272572.14 frames. ], batch size: 194, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 03:30:56,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1716372.0, ans=0.125 2023-06-27 03:32:20,571 INFO [train.py:996] (3/4) Epoch 10, batch 11650, loss[loss=0.2019, simple_loss=0.2729, pruned_loss=0.06547, over 15327.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3174, pruned_loss=0.07047, over 4264597.35 frames. ], batch size: 60, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:32:52,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.091e+02 7.350e+02 9.956e+02 1.670e+03 3.528e+03, threshold=1.991e+03, percent-clipped=18.0 2023-06-27 03:33:09,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1716732.0, ans=10.0 2023-06-27 03:33:30,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1716792.0, ans=0.04949747468305833 2023-06-27 03:33:50,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1716852.0, ans=0.09899494936611666 2023-06-27 03:34:07,063 INFO [train.py:996] (3/4) Epoch 10, batch 11700, loss[loss=0.194, simple_loss=0.2712, pruned_loss=0.05844, over 21764.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3091, pruned_loss=0.0698, over 4252828.31 frames. ], batch size: 112, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:34:57,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1717032.0, ans=0.125 2023-06-27 03:34:59,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1717032.0, ans=0.2 2023-06-27 03:35:49,709 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-27 03:35:53,419 INFO [train.py:996] (3/4) Epoch 10, batch 11750, loss[loss=0.2245, simple_loss=0.3096, pruned_loss=0.06971, over 21843.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3001, pruned_loss=0.06928, over 4255442.61 frames. ], batch size: 124, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:36:26,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.050e+02 5.774e+02 7.571e+02 1.065e+03 1.774e+03, threshold=1.514e+03, percent-clipped=0.0 2023-06-27 03:36:40,232 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.10 vs. limit=10.0 2023-06-27 03:37:41,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1717512.0, ans=0.0 2023-06-27 03:37:42,105 INFO [train.py:996] (3/4) Epoch 10, batch 11800, loss[loss=0.2275, simple_loss=0.3048, pruned_loss=0.07506, over 21324.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3009, pruned_loss=0.07101, over 4265011.54 frames. ], batch size: 159, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:38:10,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1717572.0, ans=0.0 2023-06-27 03:38:30,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1717632.0, ans=0.1 2023-06-27 03:38:38,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1717632.0, ans=0.0 2023-06-27 03:39:30,358 INFO [train.py:996] (3/4) Epoch 10, batch 11850, loss[loss=0.2207, simple_loss=0.3084, pruned_loss=0.06649, over 21831.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3025, pruned_loss=0.07018, over 4270521.98 frames. ], batch size: 351, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:39:48,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1717812.0, ans=0.125 2023-06-27 03:40:09,300 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.078e+02 6.779e+02 9.644e+02 1.423e+03 2.292e+03, threshold=1.929e+03, percent-clipped=21.0 2023-06-27 03:40:26,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1717932.0, ans=0.2 2023-06-27 03:40:57,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1717992.0, ans=0.125 2023-06-27 03:41:21,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1718052.0, ans=0.2 2023-06-27 03:41:25,951 INFO [train.py:996] (3/4) Epoch 10, batch 11900, loss[loss=0.1902, simple_loss=0.2798, pruned_loss=0.05031, over 21680.00 frames. ], tot_loss[loss=0.22, simple_loss=0.3047, pruned_loss=0.06762, over 4276258.63 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:41:31,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1718112.0, ans=0.125 2023-06-27 03:41:32,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.69 vs. limit=22.5 2023-06-27 03:42:05,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1718172.0, ans=0.1 2023-06-27 03:42:25,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.99 vs. limit=5.0 2023-06-27 03:42:31,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1718232.0, ans=0.0 2023-06-27 03:43:15,204 INFO [train.py:996] (3/4) Epoch 10, batch 11950, loss[loss=0.1705, simple_loss=0.2689, pruned_loss=0.03602, over 21785.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.3057, pruned_loss=0.06454, over 4267090.90 frames. ], batch size: 316, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:43:53,624 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.803e+02 5.577e+02 8.393e+02 1.338e+03 3.088e+03, threshold=1.679e+03, percent-clipped=11.0 2023-06-27 03:44:02,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1718472.0, ans=0.125 2023-06-27 03:44:30,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1718592.0, ans=0.05 2023-06-27 03:44:31,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-27 03:45:09,399 INFO [train.py:996] (3/4) Epoch 10, batch 12000, loss[loss=0.2237, simple_loss=0.2802, pruned_loss=0.08361, over 21574.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2991, pruned_loss=0.06366, over 4260838.40 frames. ], batch size: 414, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 03:45:09,400 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 03:45:30,592 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2595, simple_loss=0.3509, pruned_loss=0.08412, over 1796401.00 frames. 2023-06-27 03:45:30,593 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 03:45:34,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1718712.0, ans=0.2 2023-06-27 03:46:02,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2023-06-27 03:46:10,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1718832.0, ans=0.0 2023-06-27 03:46:13,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1718832.0, ans=0.125 2023-06-27 03:46:31,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1718892.0, ans=0.125 2023-06-27 03:46:34,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1718892.0, ans=0.04949747468305833 2023-06-27 03:46:40,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1718892.0, ans=0.0 2023-06-27 03:46:49,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-27 03:47:07,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1718952.0, ans=0.0 2023-06-27 03:47:18,632 INFO [train.py:996] (3/4) Epoch 10, batch 12050, loss[loss=0.2196, simple_loss=0.2944, pruned_loss=0.07235, over 21501.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2948, pruned_loss=0.0655, over 4256370.05 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 03:47:41,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1719072.0, ans=0.025 2023-06-27 03:47:53,491 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 6.182e+02 8.249e+02 1.335e+03 3.065e+03, threshold=1.650e+03, percent-clipped=10.0 2023-06-27 03:48:15,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1719132.0, ans=0.0 2023-06-27 03:48:20,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=12.0 2023-06-27 03:48:32,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1719192.0, ans=0.125 2023-06-27 03:49:01,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1719252.0, ans=0.125 2023-06-27 03:49:08,215 INFO [train.py:996] (3/4) Epoch 10, batch 12100, loss[loss=0.1836, simple_loss=0.2633, pruned_loss=0.05198, over 19991.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2988, pruned_loss=0.06863, over 4263688.85 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:50:02,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-27 03:50:22,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1719492.0, ans=0.125 2023-06-27 03:50:43,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1719552.0, ans=0.125 2023-06-27 03:51:03,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=1719552.0, ans=12.0 2023-06-27 03:51:05,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1719612.0, ans=0.0 2023-06-27 03:51:06,035 INFO [train.py:996] (3/4) Epoch 10, batch 12150, loss[loss=0.1715, simple_loss=0.29, pruned_loss=0.02646, over 19788.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.3, pruned_loss=0.06782, over 4262939.74 frames. ], batch size: 703, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:51:10,512 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.78 vs. limit=6.0 2023-06-27 03:51:31,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1719672.0, ans=0.2 2023-06-27 03:51:41,000 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.275e+02 6.507e+02 9.290e+02 1.300e+03 3.036e+03, threshold=1.858e+03, percent-clipped=15.0 2023-06-27 03:51:45,077 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 03:52:02,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1719732.0, ans=0.125 2023-06-27 03:52:05,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1719732.0, ans=0.1 2023-06-27 03:52:10,227 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-27 03:52:53,534 INFO [train.py:996] (3/4) Epoch 10, batch 12200, loss[loss=0.1881, simple_loss=0.2449, pruned_loss=0.0657, over 21201.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2973, pruned_loss=0.0669, over 4258839.98 frames. ], batch size: 548, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:52:56,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.41 vs. limit=15.0 2023-06-27 03:53:20,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1719972.0, ans=0.125 2023-06-27 03:53:32,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1720032.0, ans=0.125 2023-06-27 03:53:59,875 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.99 vs. limit=6.0 2023-06-27 03:54:14,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-27 03:54:15,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-06-27 03:54:29,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1720152.0, ans=0.09899494936611666 2023-06-27 03:54:32,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1720152.0, ans=0.125 2023-06-27 03:54:40,551 INFO [train.py:996] (3/4) Epoch 10, batch 12250, loss[loss=0.1629, simple_loss=0.2427, pruned_loss=0.04152, over 21629.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.288, pruned_loss=0.06376, over 4261346.91 frames. ], batch size: 230, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:55:13,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-27 03:55:14,845 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.738e+02 5.320e+02 7.726e+02 1.159e+03 2.410e+03, threshold=1.545e+03, percent-clipped=8.0 2023-06-27 03:56:22,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.14 vs. limit=6.0 2023-06-27 03:56:28,166 INFO [train.py:996] (3/4) Epoch 10, batch 12300, loss[loss=0.1581, simple_loss=0.2367, pruned_loss=0.03971, over 21170.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2819, pruned_loss=0.06027, over 4264745.08 frames. ], batch size: 143, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:56:49,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1720572.0, ans=0.125 2023-06-27 03:57:06,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1720632.0, ans=0.125 2023-06-27 03:57:20,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-27 03:58:16,034 INFO [train.py:996] (3/4) Epoch 10, batch 12350, loss[loss=0.2216, simple_loss=0.315, pruned_loss=0.06403, over 21856.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2859, pruned_loss=0.06096, over 4272032.68 frames. ], batch size: 316, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 03:58:50,881 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.592e+02 6.371e+02 1.042e+03 1.645e+03 3.511e+03, threshold=2.083e+03, percent-clipped=28.0 2023-06-27 03:59:27,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1720992.0, ans=0.0 2023-06-27 03:59:53,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1721052.0, ans=0.0 2023-06-27 04:00:04,491 INFO [train.py:996] (3/4) Epoch 10, batch 12400, loss[loss=0.2099, simple_loss=0.3299, pruned_loss=0.04499, over 20786.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2884, pruned_loss=0.06311, over 4273122.08 frames. ], batch size: 608, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:00:45,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1721172.0, ans=0.02 2023-06-27 04:01:06,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1721232.0, ans=0.125 2023-06-27 04:01:08,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1721232.0, ans=0.125 2023-06-27 04:01:17,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1721292.0, ans=0.125 2023-06-27 04:01:43,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1721352.0, ans=0.95 2023-06-27 04:01:58,686 INFO [train.py:996] (3/4) Epoch 10, batch 12450, loss[loss=0.2411, simple_loss=0.3145, pruned_loss=0.08386, over 21347.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2915, pruned_loss=0.06601, over 4272125.22 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:02:26,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1721472.0, ans=0.125 2023-06-27 04:02:36,086 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.371e+02 6.019e+02 7.668e+02 9.401e+02 2.639e+03, threshold=1.534e+03, percent-clipped=2.0 2023-06-27 04:03:24,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1721652.0, ans=0.2 2023-06-27 04:03:48,672 INFO [train.py:996] (3/4) Epoch 10, batch 12500, loss[loss=0.2373, simple_loss=0.3377, pruned_loss=0.06848, over 21604.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3011, pruned_loss=0.06909, over 4276998.45 frames. ], batch size: 230, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:05:16,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1721892.0, ans=0.125 2023-06-27 04:05:37,678 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-27 04:05:45,533 INFO [train.py:996] (3/4) Epoch 10, batch 12550, loss[loss=0.1795, simple_loss=0.2452, pruned_loss=0.0569, over 19904.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3071, pruned_loss=0.07158, over 4276656.32 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:06:03,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1722012.0, ans=0.2 2023-06-27 04:06:18,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-27 04:06:26,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1722072.0, ans=0.125 2023-06-27 04:06:27,276 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.271e+02 6.681e+02 8.893e+02 1.594e+03 3.232e+03, threshold=1.779e+03, percent-clipped=27.0 2023-06-27 04:07:14,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1722192.0, ans=0.0 2023-06-27 04:07:39,577 INFO [train.py:996] (3/4) Epoch 10, batch 12600, loss[loss=0.2453, simple_loss=0.3404, pruned_loss=0.07512, over 21601.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.306, pruned_loss=0.07027, over 4278630.39 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:08:06,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-27 04:08:17,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1722432.0, ans=0.0 2023-06-27 04:08:22,790 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:08:26,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1722432.0, ans=0.125 2023-06-27 04:08:31,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1722432.0, ans=0.0 2023-06-27 04:08:47,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-27 04:08:57,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.91 vs. limit=22.5 2023-06-27 04:09:20,820 INFO [train.py:996] (3/4) Epoch 10, batch 12650, loss[loss=0.2057, simple_loss=0.2805, pruned_loss=0.06547, over 21911.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2992, pruned_loss=0.06641, over 4286129.76 frames. ], batch size: 316, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:09:33,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1722612.0, ans=0.07 2023-06-27 04:10:02,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 6.359e+02 1.024e+03 1.411e+03 2.503e+03, threshold=2.048e+03, percent-clipped=9.0 2023-06-27 04:11:14,819 INFO [train.py:996] (3/4) Epoch 10, batch 12700, loss[loss=0.2591, simple_loss=0.3257, pruned_loss=0.09628, over 21460.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.299, pruned_loss=0.06897, over 4288576.94 frames. ], batch size: 471, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:11:47,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1722972.0, ans=0.125 2023-06-27 04:12:18,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1723092.0, ans=0.2 2023-06-27 04:12:18,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1723092.0, ans=0.0 2023-06-27 04:12:26,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=12.0 2023-06-27 04:13:08,218 INFO [train.py:996] (3/4) Epoch 10, batch 12750, loss[loss=0.1972, simple_loss=0.2515, pruned_loss=0.07144, over 20144.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2993, pruned_loss=0.06853, over 4284965.95 frames. ], batch size: 703, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:13:10,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1723212.0, ans=0.125 2023-06-27 04:13:28,128 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-27 04:13:38,770 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.282e+02 6.128e+02 7.827e+02 1.074e+03 2.616e+03, threshold=1.565e+03, percent-clipped=3.0 2023-06-27 04:13:39,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1723272.0, ans=0.07 2023-06-27 04:13:43,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1723332.0, ans=0.0 2023-06-27 04:13:44,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1723332.0, ans=0.125 2023-06-27 04:14:01,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=15.0 2023-06-27 04:14:21,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1723392.0, ans=0.2 2023-06-27 04:14:50,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1723452.0, ans=0.125 2023-06-27 04:14:55,441 INFO [train.py:996] (3/4) Epoch 10, batch 12800, loss[loss=0.2789, simple_loss=0.3309, pruned_loss=0.1135, over 21587.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2989, pruned_loss=0.0688, over 4286338.43 frames. ], batch size: 508, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:14:58,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=22.5 2023-06-27 04:16:06,555 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:16:45,029 INFO [train.py:996] (3/4) Epoch 10, batch 12850, loss[loss=0.1928, simple_loss=0.2936, pruned_loss=0.04598, over 21756.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3013, pruned_loss=0.07002, over 4282497.01 frames. ], batch size: 351, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:16:48,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-27 04:17:22,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.972e+02 5.917e+02 7.824e+02 1.083e+03 2.191e+03, threshold=1.565e+03, percent-clipped=6.0 2023-06-27 04:17:53,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1723992.0, ans=0.1 2023-06-27 04:17:56,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1723992.0, ans=0.0 2023-06-27 04:18:19,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1724052.0, ans=0.125 2023-06-27 04:18:34,526 INFO [train.py:996] (3/4) Epoch 10, batch 12900, loss[loss=0.1929, simple_loss=0.2848, pruned_loss=0.05047, over 21725.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2988, pruned_loss=0.06684, over 4280279.02 frames. ], batch size: 298, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:19:00,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.71 vs. limit=22.5 2023-06-27 04:19:34,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1724232.0, ans=0.0 2023-06-27 04:19:34,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1724232.0, ans=0.125 2023-06-27 04:20:03,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1724292.0, ans=0.125 2023-06-27 04:20:23,520 INFO [train.py:996] (3/4) Epoch 10, batch 12950, loss[loss=0.221, simple_loss=0.302, pruned_loss=0.07, over 21578.00 frames. ], tot_loss[loss=0.214, simple_loss=0.297, pruned_loss=0.06554, over 4279742.52 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:20:30,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1724412.0, ans=0.0 2023-06-27 04:20:42,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1724412.0, ans=0.125 2023-06-27 04:21:19,205 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.055e+02 6.814e+02 9.301e+02 1.537e+03 3.645e+03, threshold=1.860e+03, percent-clipped=23.0 2023-06-27 04:21:49,769 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.78 vs. limit=10.0 2023-06-27 04:22:02,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1724652.0, ans=0.2 2023-06-27 04:22:09,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1724652.0, ans=0.0 2023-06-27 04:22:10,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.60 vs. limit=10.0 2023-06-27 04:22:17,940 INFO [train.py:996] (3/4) Epoch 10, batch 13000, loss[loss=0.1406, simple_loss=0.2094, pruned_loss=0.0359, over 21842.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2957, pruned_loss=0.06538, over 4285263.17 frames. ], batch size: 102, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:24:05,860 INFO [train.py:996] (3/4) Epoch 10, batch 13050, loss[loss=0.21, simple_loss=0.2729, pruned_loss=0.07355, over 21589.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2924, pruned_loss=0.06442, over 4281557.16 frames. ], batch size: 548, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:24:29,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-27 04:24:49,086 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.711e+02 5.473e+02 7.954e+02 1.041e+03 2.275e+03, threshold=1.591e+03, percent-clipped=1.0 2023-06-27 04:25:29,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1725252.0, ans=0.0 2023-06-27 04:25:53,804 INFO [train.py:996] (3/4) Epoch 10, batch 13100, loss[loss=0.2023, simple_loss=0.282, pruned_loss=0.06127, over 21452.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.295, pruned_loss=0.06475, over 4284071.28 frames. ], batch size: 211, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:26:38,445 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-06-27 04:26:53,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1725432.0, ans=0.125 2023-06-27 04:27:30,797 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:27:43,054 INFO [train.py:996] (3/4) Epoch 10, batch 13150, loss[loss=0.2253, simple_loss=0.2989, pruned_loss=0.0759, over 21431.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2987, pruned_loss=0.06707, over 4283405.13 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:27:52,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1725612.0, ans=0.125 2023-06-27 04:28:01,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.13 vs. limit=22.5 2023-06-27 04:28:01,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1725612.0, ans=0.0 2023-06-27 04:28:15,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1725672.0, ans=0.0 2023-06-27 04:28:24,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1725672.0, ans=0.0 2023-06-27 04:28:29,194 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:28:30,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1725672.0, ans=0.125 2023-06-27 04:28:32,054 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.070e+02 6.134e+02 8.116e+02 1.164e+03 2.711e+03, threshold=1.623e+03, percent-clipped=9.0 2023-06-27 04:28:32,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1725732.0, ans=0.125 2023-06-27 04:29:04,570 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-06-27 04:29:21,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1725852.0, ans=0.125 2023-06-27 04:29:37,417 INFO [train.py:996] (3/4) Epoch 10, batch 13200, loss[loss=0.2132, simple_loss=0.2906, pruned_loss=0.06794, over 21503.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2985, pruned_loss=0.06693, over 4281464.76 frames. ], batch size: 194, lr: 2.95e-03, grad_scale: 32.0 2023-06-27 04:29:39,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1725912.0, ans=0.0 2023-06-27 04:29:43,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1725912.0, ans=0.125 2023-06-27 04:29:57,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-27 04:30:05,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1725972.0, ans=0.1 2023-06-27 04:30:55,122 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:30:58,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1726152.0, ans=0.2 2023-06-27 04:31:04,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1726152.0, ans=0.125 2023-06-27 04:31:26,741 INFO [train.py:996] (3/4) Epoch 10, batch 13250, loss[loss=0.2058, simple_loss=0.2709, pruned_loss=0.07038, over 21788.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2974, pruned_loss=0.06891, over 4279186.43 frames. ], batch size: 102, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:31:27,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1726212.0, ans=0.95 2023-06-27 04:31:48,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1726272.0, ans=0.125 2023-06-27 04:32:06,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.999e+02 7.655e+02 1.062e+03 1.668e+03 3.650e+03, threshold=2.123e+03, percent-clipped=27.0 2023-06-27 04:32:07,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1726332.0, ans=0.2 2023-06-27 04:32:22,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1726332.0, ans=0.2 2023-06-27 04:32:22,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1726332.0, ans=0.125 2023-06-27 04:32:38,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1726392.0, ans=0.0 2023-06-27 04:33:21,183 INFO [train.py:996] (3/4) Epoch 10, batch 13300, loss[loss=0.2681, simple_loss=0.3454, pruned_loss=0.09544, over 21428.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3001, pruned_loss=0.06957, over 4284073.50 frames. ], batch size: 471, lr: 2.95e-03, grad_scale: 16.0 2023-06-27 04:33:27,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1726512.0, ans=0.1 2023-06-27 04:33:28,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1726512.0, ans=0.125 2023-06-27 04:33:46,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1726572.0, ans=0.125 2023-06-27 04:34:20,577 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.84 vs. limit=5.0 2023-06-27 04:34:37,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1726692.0, ans=0.035 2023-06-27 04:34:57,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-27 04:35:10,293 INFO [train.py:996] (3/4) Epoch 10, batch 13350, loss[loss=0.2223, simple_loss=0.2995, pruned_loss=0.07251, over 21845.00 frames. ], tot_loss[loss=0.224, simple_loss=0.304, pruned_loss=0.07207, over 4285187.37 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:35:29,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1726872.0, ans=0.125 2023-06-27 04:35:48,972 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.150e+02 5.865e+02 7.490e+02 1.135e+03 2.182e+03, threshold=1.498e+03, percent-clipped=1.0 2023-06-27 04:36:44,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1727052.0, ans=0.2 2023-06-27 04:36:50,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1727052.0, ans=0.0 2023-06-27 04:36:58,398 INFO [train.py:996] (3/4) Epoch 10, batch 13400, loss[loss=0.2455, simple_loss=0.3131, pruned_loss=0.08897, over 21769.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3056, pruned_loss=0.07343, over 4287776.03 frames. ], batch size: 351, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:38:01,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1727232.0, ans=10.0 2023-06-27 04:38:15,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1727292.0, ans=0.2 2023-06-27 04:38:47,927 INFO [train.py:996] (3/4) Epoch 10, batch 13450, loss[loss=0.2123, simple_loss=0.282, pruned_loss=0.07127, over 21787.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3069, pruned_loss=0.07445, over 4284396.95 frames. ], batch size: 124, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:38:53,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1727412.0, ans=0.0 2023-06-27 04:39:02,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1727412.0, ans=0.1 2023-06-27 04:39:21,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1727472.0, ans=0.05 2023-06-27 04:39:25,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1727472.0, ans=0.04949747468305833 2023-06-27 04:39:31,657 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2023-06-27 04:39:39,484 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 5.946e+02 7.827e+02 1.298e+03 2.826e+03, threshold=1.565e+03, percent-clipped=16.0 2023-06-27 04:40:15,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1727592.0, ans=15.0 2023-06-27 04:40:40,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1727652.0, ans=0.2 2023-06-27 04:40:42,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1727712.0, ans=0.0 2023-06-27 04:40:42,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1727712.0, ans=0.1 2023-06-27 04:40:43,701 INFO [train.py:996] (3/4) Epoch 10, batch 13500, loss[loss=0.207, simple_loss=0.2795, pruned_loss=0.06722, over 21399.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2974, pruned_loss=0.07152, over 4271569.47 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:41:04,692 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-27 04:41:42,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1727832.0, ans=0.125 2023-06-27 04:42:03,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1727892.0, ans=0.04949747468305833 2023-06-27 04:42:06,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1727892.0, ans=0.125 2023-06-27 04:42:35,508 INFO [train.py:996] (3/4) Epoch 10, batch 13550, loss[loss=0.2404, simple_loss=0.331, pruned_loss=0.07487, over 21413.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2998, pruned_loss=0.07056, over 4275242.84 frames. ], batch size: 194, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:43:12,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1728072.0, ans=0.125 2023-06-27 04:43:16,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-27 04:43:25,544 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.534e+02 7.345e+02 1.395e+03 2.191e+03 3.934e+03, threshold=2.790e+03, percent-clipped=45.0 2023-06-27 04:43:46,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1728192.0, ans=0.125 2023-06-27 04:44:00,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1728252.0, ans=0.125 2023-06-27 04:44:21,767 INFO [train.py:996] (3/4) Epoch 10, batch 13600, loss[loss=0.1914, simple_loss=0.2637, pruned_loss=0.05959, over 21130.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3006, pruned_loss=0.07127, over 4280201.08 frames. ], batch size: 608, lr: 2.94e-03, grad_scale: 32.0 2023-06-27 04:45:18,318 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=22.5 2023-06-27 04:45:29,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1728492.0, ans=0.0 2023-06-27 04:45:45,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1728492.0, ans=0.125 2023-06-27 04:45:48,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1728492.0, ans=0.125 2023-06-27 04:46:00,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1728552.0, ans=0.0 2023-06-27 04:46:03,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1728552.0, ans=0.0 2023-06-27 04:46:13,910 INFO [train.py:996] (3/4) Epoch 10, batch 13650, loss[loss=0.2061, simple_loss=0.2692, pruned_loss=0.07148, over 21286.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.296, pruned_loss=0.06857, over 4276633.09 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:46:16,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1728612.0, ans=0.025 2023-06-27 04:46:56,145 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-27 04:46:59,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.45 vs. limit=12.0 2023-06-27 04:46:59,888 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.764e+02 5.044e+02 6.157e+02 8.736e+02 2.830e+03, threshold=1.231e+03, percent-clipped=2.0 2023-06-27 04:47:54,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1728852.0, ans=0.125 2023-06-27 04:48:02,144 INFO [train.py:996] (3/4) Epoch 10, batch 13700, loss[loss=0.1731, simple_loss=0.2321, pruned_loss=0.05708, over 21819.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2928, pruned_loss=0.06794, over 4276258.48 frames. ], batch size: 124, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:48:39,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1728972.0, ans=0.0 2023-06-27 04:48:57,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.28 vs. limit=12.0 2023-06-27 04:49:50,678 INFO [train.py:996] (3/4) Epoch 10, batch 13750, loss[loss=0.2179, simple_loss=0.3066, pruned_loss=0.06466, over 21626.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2935, pruned_loss=0.06779, over 4277347.22 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:50:28,509 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:50:44,254 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.944e+02 7.619e+02 1.226e+03 1.767e+03 3.252e+03, threshold=2.451e+03, percent-clipped=47.0 2023-06-27 04:51:08,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1729392.0, ans=0.0 2023-06-27 04:51:52,081 INFO [train.py:996] (3/4) Epoch 10, batch 13800, loss[loss=0.2442, simple_loss=0.3496, pruned_loss=0.06943, over 21847.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2966, pruned_loss=0.06606, over 4264461.81 frames. ], batch size: 371, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:52:08,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1729572.0, ans=0.1 2023-06-27 04:52:14,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=12.0 2023-06-27 04:52:23,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1729572.0, ans=0.1 2023-06-27 04:53:31,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-06-27 04:53:40,108 INFO [train.py:996] (3/4) Epoch 10, batch 13850, loss[loss=0.2267, simple_loss=0.311, pruned_loss=0.07118, over 21382.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2998, pruned_loss=0.06649, over 4264491.21 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 04:53:56,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1729812.0, ans=0.07 2023-06-27 04:54:01,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-27 04:54:23,605 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 7.886e+02 1.223e+03 1.813e+03 4.044e+03, threshold=2.445e+03, percent-clipped=7.0 2023-06-27 04:54:54,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1729992.0, ans=0.5 2023-06-27 04:55:06,784 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.51 vs. limit=10.0 2023-06-27 04:55:28,111 INFO [train.py:996] (3/4) Epoch 10, batch 13900, loss[loss=0.2117, simple_loss=0.2877, pruned_loss=0.06787, over 21824.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3039, pruned_loss=0.07, over 4272413.04 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 04:55:32,676 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=12.0 2023-06-27 04:55:44,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-27 04:55:52,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1730172.0, ans=0.0 2023-06-27 04:56:04,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1730172.0, ans=0.125 2023-06-27 04:57:06,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1730352.0, ans=0.0 2023-06-27 04:57:14,334 INFO [train.py:996] (3/4) Epoch 10, batch 13950, loss[loss=0.2143, simple_loss=0.2892, pruned_loss=0.06973, over 21636.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3037, pruned_loss=0.07165, over 4280393.35 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 04:57:32,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1730412.0, ans=0.125 2023-06-27 04:57:34,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1730412.0, ans=0.025 2023-06-27 04:57:35,768 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 04:58:02,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.602e+02 6.616e+02 8.570e+02 1.217e+03 2.156e+03, threshold=1.714e+03, percent-clipped=0.0 2023-06-27 04:58:24,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1730592.0, ans=0.125 2023-06-27 04:58:59,371 INFO [train.py:996] (3/4) Epoch 10, batch 14000, loss[loss=0.1834, simple_loss=0.2704, pruned_loss=0.0482, over 21436.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3009, pruned_loss=0.06949, over 4275513.21 frames. ], batch size: 131, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 04:59:30,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.13 vs. limit=15.0 2023-06-27 04:59:45,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1730832.0, ans=0.2 2023-06-27 05:00:07,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1730892.0, ans=0.0 2023-06-27 05:00:51,600 INFO [train.py:996] (3/4) Epoch 10, batch 14050, loss[loss=0.1929, simple_loss=0.2896, pruned_loss=0.04816, over 21680.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2969, pruned_loss=0.06631, over 4261726.02 frames. ], batch size: 414, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:01:30,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1731132.0, ans=0.125 2023-06-27 05:01:33,549 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.649e+02 7.272e+02 1.104e+03 1.609e+03 3.327e+03, threshold=2.207e+03, percent-clipped=18.0 2023-06-27 05:01:38,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-27 05:01:46,539 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-27 05:01:55,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-27 05:02:02,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-27 05:02:14,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1731252.0, ans=0.1 2023-06-27 05:02:17,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1731252.0, ans=0.0 2023-06-27 05:02:27,180 INFO [train.py:996] (3/4) Epoch 10, batch 14100, loss[loss=0.2316, simple_loss=0.3074, pruned_loss=0.07791, over 21717.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2911, pruned_loss=0.06635, over 4262737.88 frames. ], batch size: 351, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:03:09,788 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-27 05:03:12,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1731432.0, ans=0.0 2023-06-27 05:03:52,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1731492.0, ans=0.0 2023-06-27 05:04:00,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1731552.0, ans=0.5 2023-06-27 05:04:06,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.47 vs. limit=10.0 2023-06-27 05:04:12,969 INFO [train.py:996] (3/4) Epoch 10, batch 14150, loss[loss=0.2271, simple_loss=0.3092, pruned_loss=0.07249, over 21368.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2952, pruned_loss=0.06752, over 4258093.69 frames. ], batch size: 160, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:04:59,065 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.088e+02 7.057e+02 1.107e+03 1.740e+03 3.584e+03, threshold=2.215e+03, percent-clipped=8.0 2023-06-27 05:05:55,685 INFO [train.py:996] (3/4) Epoch 10, batch 14200, loss[loss=0.2026, simple_loss=0.2791, pruned_loss=0.06311, over 21306.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2933, pruned_loss=0.06636, over 4269835.95 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:06:36,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1732032.0, ans=0.0 2023-06-27 05:06:45,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1732032.0, ans=0.05 2023-06-27 05:07:21,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1732152.0, ans=0.1 2023-06-27 05:07:41,159 INFO [train.py:996] (3/4) Epoch 10, batch 14250, loss[loss=0.2036, simple_loss=0.2649, pruned_loss=0.07115, over 21372.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2881, pruned_loss=0.06639, over 4266102.05 frames. ], batch size: 144, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:07:43,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1732212.0, ans=0.1 2023-06-27 05:07:56,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1732212.0, ans=0.0 2023-06-27 05:07:58,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1732272.0, ans=0.0 2023-06-27 05:08:16,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-27 05:08:32,536 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.770e+02 5.743e+02 8.448e+02 1.114e+03 2.445e+03, threshold=1.690e+03, percent-clipped=1.0 2023-06-27 05:09:25,870 INFO [train.py:996] (3/4) Epoch 10, batch 14300, loss[loss=0.3421, simple_loss=0.4348, pruned_loss=0.1246, over 21529.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2918, pruned_loss=0.06679, over 4247624.40 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:10:12,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1732632.0, ans=0.1 2023-06-27 05:10:19,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1732632.0, ans=0.125 2023-06-27 05:10:26,402 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2023-06-27 05:10:56,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1732752.0, ans=0.125 2023-06-27 05:11:04,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1732752.0, ans=0.1 2023-06-27 05:11:11,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1732752.0, ans=0.125 2023-06-27 05:11:14,207 INFO [train.py:996] (3/4) Epoch 10, batch 14350, loss[loss=0.2299, simple_loss=0.3046, pruned_loss=0.07759, over 21747.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2958, pruned_loss=0.06682, over 4232325.82 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:11:20,916 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.90 vs. limit=22.5 2023-06-27 05:11:49,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1732872.0, ans=0.125 2023-06-27 05:12:04,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.018e+02 7.754e+02 1.154e+03 1.779e+03 3.670e+03, threshold=2.308e+03, percent-clipped=30.0 2023-06-27 05:13:00,579 INFO [train.py:996] (3/4) Epoch 10, batch 14400, loss[loss=0.1817, simple_loss=0.2554, pruned_loss=0.05401, over 21572.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2929, pruned_loss=0.06716, over 4244796.67 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:13:57,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1733232.0, ans=0.2 2023-06-27 05:14:04,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1733292.0, ans=0.125 2023-06-27 05:14:33,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1733352.0, ans=0.0 2023-06-27 05:14:40,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1733352.0, ans=0.0 2023-06-27 05:14:46,467 INFO [train.py:996] (3/4) Epoch 10, batch 14450, loss[loss=0.2181, simple_loss=0.2811, pruned_loss=0.07754, over 21621.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2874, pruned_loss=0.06699, over 4250538.53 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:15:21,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1733472.0, ans=0.125 2023-06-27 05:15:36,479 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.984e+02 5.618e+02 7.352e+02 1.088e+03 2.382e+03, threshold=1.470e+03, percent-clipped=1.0 2023-06-27 05:15:37,203 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 05:15:54,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1733592.0, ans=0.1 2023-06-27 05:16:28,031 INFO [train.py:996] (3/4) Epoch 10, batch 14500, loss[loss=0.2004, simple_loss=0.2789, pruned_loss=0.06093, over 21798.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.2847, pruned_loss=0.06701, over 4253778.98 frames. ], batch size: 118, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:16:37,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1733712.0, ans=0.1 2023-06-27 05:17:14,388 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-27 05:17:17,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1733832.0, ans=0.125 2023-06-27 05:17:29,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1733892.0, ans=0.0 2023-06-27 05:17:33,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1733892.0, ans=0.125 2023-06-27 05:17:48,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=15.0 2023-06-27 05:17:55,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1733952.0, ans=0.0 2023-06-27 05:18:01,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1733952.0, ans=0.125 2023-06-27 05:18:12,020 INFO [train.py:996] (3/4) Epoch 10, batch 14550, loss[loss=0.2331, simple_loss=0.3151, pruned_loss=0.07558, over 21983.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2895, pruned_loss=0.06844, over 4260204.98 frames. ], batch size: 317, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:19:02,694 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.284e+02 5.674e+02 7.709e+02 1.144e+03 2.600e+03, threshold=1.542e+03, percent-clipped=15.0 2023-06-27 05:19:18,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1734132.0, ans=0.125 2023-06-27 05:19:33,103 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 05:19:47,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1734252.0, ans=0.1 2023-06-27 05:19:52,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1734252.0, ans=0.125 2023-06-27 05:20:05,587 INFO [train.py:996] (3/4) Epoch 10, batch 14600, loss[loss=0.2819, simple_loss=0.3512, pruned_loss=0.1063, over 21462.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2967, pruned_loss=0.07063, over 4258577.93 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:20:09,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1734312.0, ans=0.0 2023-06-27 05:20:35,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1734372.0, ans=0.125 2023-06-27 05:21:47,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1734612.0, ans=0.125 2023-06-27 05:21:48,275 INFO [train.py:996] (3/4) Epoch 10, batch 14650, loss[loss=0.1625, simple_loss=0.2493, pruned_loss=0.03786, over 21699.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2987, pruned_loss=0.07019, over 4246106.32 frames. ], batch size: 247, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:21:58,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1734612.0, ans=0.125 2023-06-27 05:22:26,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1734672.0, ans=0.125 2023-06-27 05:22:39,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 5.657e+02 7.781e+02 1.109e+03 2.213e+03, threshold=1.556e+03, percent-clipped=10.0 2023-06-27 05:23:06,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1734792.0, ans=0.0 2023-06-27 05:23:14,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1734792.0, ans=0.0 2023-06-27 05:23:30,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1734852.0, ans=0.125 2023-06-27 05:23:37,271 INFO [train.py:996] (3/4) Epoch 10, batch 14700, loss[loss=0.2547, simple_loss=0.3551, pruned_loss=0.07714, over 21549.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2918, pruned_loss=0.06513, over 4241602.32 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:24:15,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1734972.0, ans=0.125 2023-06-27 05:24:36,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1735032.0, ans=0.0 2023-06-27 05:24:51,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1735092.0, ans=0.2 2023-06-27 05:25:20,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1735152.0, ans=15.0 2023-06-27 05:25:36,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1735152.0, ans=0.1 2023-06-27 05:25:36,552 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.02 vs. limit=10.0 2023-06-27 05:25:38,797 INFO [train.py:996] (3/4) Epoch 10, batch 14750, loss[loss=0.2415, simple_loss=0.3114, pruned_loss=0.08581, over 21455.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2974, pruned_loss=0.0679, over 4255844.97 frames. ], batch size: 194, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:26:30,559 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.686e+02 7.000e+02 1.273e+03 1.820e+03 3.687e+03, threshold=2.546e+03, percent-clipped=36.0 2023-06-27 05:26:33,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.68 vs. limit=15.0 2023-06-27 05:26:36,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1735332.0, ans=0.125 2023-06-27 05:27:26,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1735452.0, ans=0.125 2023-06-27 05:27:29,206 INFO [train.py:996] (3/4) Epoch 10, batch 14800, loss[loss=0.2393, simple_loss=0.3067, pruned_loss=0.08595, over 21500.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3087, pruned_loss=0.07361, over 4259911.30 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 32.0 2023-06-27 05:28:15,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=22.5 2023-06-27 05:28:27,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1735632.0, ans=0.125 2023-06-27 05:28:28,519 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-27 05:28:31,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1735632.0, ans=0.2 2023-06-27 05:28:48,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1735692.0, ans=0.0 2023-06-27 05:28:50,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1735692.0, ans=0.125 2023-06-27 05:29:04,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=1735752.0, ans=0.2 2023-06-27 05:29:29,459 INFO [train.py:996] (3/4) Epoch 10, batch 14850, loss[loss=0.1825, simple_loss=0.2535, pruned_loss=0.05581, over 21483.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3016, pruned_loss=0.07262, over 4257301.59 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:29:35,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1735812.0, ans=0.0 2023-06-27 05:29:49,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1735872.0, ans=0.125 2023-06-27 05:30:16,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 5.316e+02 7.277e+02 1.299e+03 3.940e+03, threshold=1.455e+03, percent-clipped=5.0 2023-06-27 05:31:19,412 INFO [train.py:996] (3/4) Epoch 10, batch 14900, loss[loss=0.2161, simple_loss=0.2977, pruned_loss=0.06721, over 20012.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3045, pruned_loss=0.07415, over 4265326.57 frames. ], batch size: 703, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:31:23,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736112.0, ans=0.1 2023-06-27 05:32:00,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-06-27 05:33:06,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1736352.0, ans=0.125 2023-06-27 05:33:11,108 INFO [train.py:996] (3/4) Epoch 10, batch 14950, loss[loss=0.2222, simple_loss=0.3133, pruned_loss=0.06559, over 21639.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3046, pruned_loss=0.07319, over 4260694.67 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:33:11,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1736412.0, ans=0.5 2023-06-27 05:33:39,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1736472.0, ans=0.125 2023-06-27 05:33:42,042 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-27 05:33:57,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1736532.0, ans=0.125 2023-06-27 05:34:05,491 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.924e+02 5.785e+02 8.505e+02 1.255e+03 2.502e+03, threshold=1.701e+03, percent-clipped=18.0 2023-06-27 05:34:11,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1736532.0, ans=0.1 2023-06-27 05:34:35,256 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.49 vs. limit=10.0 2023-06-27 05:34:48,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736652.0, ans=0.1 2023-06-27 05:34:57,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1736652.0, ans=0.125 2023-06-27 05:35:00,010 INFO [train.py:996] (3/4) Epoch 10, batch 15000, loss[loss=0.2335, simple_loss=0.2907, pruned_loss=0.08815, over 16085.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3056, pruned_loss=0.07422, over 4252259.70 frames. ], batch size: 60, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:35:00,010 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 05:35:19,882 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2554, simple_loss=0.3462, pruned_loss=0.08227, over 1796401.00 frames. 2023-06-27 05:35:19,883 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 05:36:23,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1736892.0, ans=0.05 2023-06-27 05:36:37,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1736892.0, ans=0.0 2023-06-27 05:36:46,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736952.0, ans=0.1 2023-06-27 05:37:04,909 INFO [train.py:996] (3/4) Epoch 10, batch 15050, loss[loss=0.2136, simple_loss=0.2937, pruned_loss=0.06677, over 21775.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3081, pruned_loss=0.07556, over 4257547.46 frames. ], batch size: 247, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:38:05,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.468e+02 6.013e+02 1.020e+03 1.764e+03 3.653e+03, threshold=2.041e+03, percent-clipped=28.0 2023-06-27 05:38:14,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1737192.0, ans=0.2 2023-06-27 05:38:18,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1737192.0, ans=0.015 2023-06-27 05:38:59,241 INFO [train.py:996] (3/4) Epoch 10, batch 15100, loss[loss=0.2311, simple_loss=0.3066, pruned_loss=0.07779, over 21445.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3104, pruned_loss=0.07479, over 4256511.26 frames. ], batch size: 211, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:39:01,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1737312.0, ans=0.2 2023-06-27 05:39:03,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1737312.0, ans=0.0 2023-06-27 05:39:10,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1737312.0, ans=0.125 2023-06-27 05:39:25,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1737372.0, ans=0.125 2023-06-27 05:39:57,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1737432.0, ans=0.2 2023-06-27 05:39:59,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1737432.0, ans=0.125 2023-06-27 05:40:48,200 INFO [train.py:996] (3/4) Epoch 10, batch 15150, loss[loss=0.2052, simple_loss=0.2714, pruned_loss=0.06945, over 21149.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3063, pruned_loss=0.07456, over 4253498.09 frames. ], batch size: 143, lr: 2.94e-03, grad_scale: 8.0 2023-06-27 05:40:52,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1737612.0, ans=0.0 2023-06-27 05:41:25,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1737672.0, ans=0.125 2023-06-27 05:41:42,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 5.996e+02 8.329e+02 1.455e+03 4.229e+03, threshold=1.666e+03, percent-clipped=17.0 2023-06-27 05:41:50,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1737792.0, ans=0.0 2023-06-27 05:42:01,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1737792.0, ans=0.0 2023-06-27 05:42:36,435 INFO [train.py:996] (3/4) Epoch 10, batch 15200, loss[loss=0.2106, simple_loss=0.3039, pruned_loss=0.05871, over 21546.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2961, pruned_loss=0.07041, over 4257310.83 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:43:06,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1737972.0, ans=0.1 2023-06-27 05:44:22,673 INFO [train.py:996] (3/4) Epoch 10, batch 15250, loss[loss=0.1963, simple_loss=0.267, pruned_loss=0.06282, over 21474.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2907, pruned_loss=0.06887, over 4261926.76 frames. ], batch size: 212, lr: 2.94e-03, grad_scale: 16.0 2023-06-27 05:44:56,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1738272.0, ans=0.0 2023-06-27 05:45:03,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1738272.0, ans=0.025 2023-06-27 05:45:04,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-27 05:45:16,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.911e+02 6.076e+02 9.164e+02 1.527e+03 3.060e+03, threshold=1.833e+03, percent-clipped=16.0 2023-06-27 05:45:50,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1738392.0, ans=0.125 2023-06-27 05:46:11,052 INFO [train.py:996] (3/4) Epoch 10, batch 15300, loss[loss=0.2266, simple_loss=0.3053, pruned_loss=0.07388, over 21447.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2939, pruned_loss=0.07134, over 4263909.20 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:47:06,464 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=12.0 2023-06-27 05:47:56,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1738752.0, ans=0.125 2023-06-27 05:47:56,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1738752.0, ans=0.125 2023-06-27 05:47:58,645 INFO [train.py:996] (3/4) Epoch 10, batch 15350, loss[loss=0.2057, simple_loss=0.2959, pruned_loss=0.05775, over 21449.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2998, pruned_loss=0.07305, over 4258004.23 frames. ], batch size: 194, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:48:10,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1738812.0, ans=0.0 2023-06-27 05:48:11,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1738812.0, ans=0.2 2023-06-27 05:48:48,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1738932.0, ans=0.125 2023-06-27 05:48:51,334 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.081e+02 6.656e+02 9.808e+02 1.431e+03 3.197e+03, threshold=1.962e+03, percent-clipped=8.0 2023-06-27 05:49:36,590 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 05:49:45,873 INFO [train.py:996] (3/4) Epoch 10, batch 15400, loss[loss=0.2175, simple_loss=0.2865, pruned_loss=0.07421, over 21311.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3006, pruned_loss=0.07175, over 4273029.87 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:50:35,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1739232.0, ans=0.125 2023-06-27 05:50:47,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1739292.0, ans=0.0 2023-06-27 05:50:51,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1739292.0, ans=0.2 2023-06-27 05:51:00,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1739292.0, ans=0.0 2023-06-27 05:51:02,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1739292.0, ans=0.05 2023-06-27 05:51:32,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1739412.0, ans=0.125 2023-06-27 05:51:33,734 INFO [train.py:996] (3/4) Epoch 10, batch 15450, loss[loss=0.194, simple_loss=0.294, pruned_loss=0.047, over 21788.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2983, pruned_loss=0.07124, over 4278997.96 frames. ], batch size: 298, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:52:17,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1739532.0, ans=0.125 2023-06-27 05:52:28,017 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.164e+02 6.328e+02 9.249e+02 1.410e+03 2.980e+03, threshold=1.850e+03, percent-clipped=8.0 2023-06-27 05:52:56,484 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-27 05:52:56,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.95 vs. limit=22.5 2023-06-27 05:53:17,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1739652.0, ans=0.0 2023-06-27 05:53:20,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1739652.0, ans=0.125 2023-06-27 05:53:29,065 INFO [train.py:996] (3/4) Epoch 10, batch 15500, loss[loss=0.2326, simple_loss=0.3172, pruned_loss=0.07401, over 21826.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3002, pruned_loss=0.0717, over 4259629.20 frames. ], batch size: 282, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:55:09,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1739952.0, ans=0.125 2023-06-27 05:55:23,932 INFO [train.py:996] (3/4) Epoch 10, batch 15550, loss[loss=0.2206, simple_loss=0.3054, pruned_loss=0.06792, over 21643.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2976, pruned_loss=0.0693, over 4263873.20 frames. ], batch size: 414, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:55:33,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1740012.0, ans=0.1 2023-06-27 05:55:48,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1740072.0, ans=0.0 2023-06-27 05:56:17,358 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.325e+02 6.960e+02 1.269e+03 1.845e+03 3.300e+03, threshold=2.538e+03, percent-clipped=23.0 2023-06-27 05:56:33,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1740192.0, ans=0.125 2023-06-27 05:56:36,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.98 vs. limit=8.0 2023-06-27 05:56:43,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=15.0 2023-06-27 05:56:56,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2023-06-27 05:57:11,150 INFO [train.py:996] (3/4) Epoch 10, batch 15600, loss[loss=0.1713, simple_loss=0.2393, pruned_loss=0.05167, over 21510.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2909, pruned_loss=0.06812, over 4255601.08 frames. ], batch size: 230, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 05:58:27,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-27 05:58:59,224 INFO [train.py:996] (3/4) Epoch 10, batch 15650, loss[loss=0.2093, simple_loss=0.276, pruned_loss=0.0713, over 21300.00 frames. ], tot_loss[loss=0.213, simple_loss=0.29, pruned_loss=0.06799, over 4263617.92 frames. ], batch size: 144, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 05:59:15,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1740672.0, ans=0.125 2023-06-27 05:59:49,308 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.999e+02 5.253e+02 8.016e+02 1.068e+03 2.204e+03, threshold=1.603e+03, percent-clipped=0.0 2023-06-27 06:00:38,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1740852.0, ans=0.0 2023-06-27 06:00:41,573 INFO [train.py:996] (3/4) Epoch 10, batch 15700, loss[loss=0.1984, simple_loss=0.2948, pruned_loss=0.05104, over 21190.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2859, pruned_loss=0.06662, over 4255824.58 frames. ], batch size: 549, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:00:56,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1740912.0, ans=0.0 2023-06-27 06:01:26,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1741032.0, ans=0.125 2023-06-27 06:01:36,769 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-27 06:02:22,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1741152.0, ans=0.035 2023-06-27 06:02:28,423 INFO [train.py:996] (3/4) Epoch 10, batch 15750, loss[loss=0.2417, simple_loss=0.3067, pruned_loss=0.08829, over 21858.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2826, pruned_loss=0.06644, over 4251320.77 frames. ], batch size: 98, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:03:22,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.061e+02 5.702e+02 8.249e+02 1.125e+03 2.008e+03, threshold=1.650e+03, percent-clipped=7.0 2023-06-27 06:03:34,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-27 06:04:08,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1741452.0, ans=0.125 2023-06-27 06:04:13,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1741512.0, ans=0.2 2023-06-27 06:04:14,228 INFO [train.py:996] (3/4) Epoch 10, batch 15800, loss[loss=0.1776, simple_loss=0.2485, pruned_loss=0.0534, over 21373.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2779, pruned_loss=0.06531, over 4251487.63 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:04:14,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1741512.0, ans=0.035 2023-06-27 06:04:52,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1741632.0, ans=0.2 2023-06-27 06:05:01,716 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.92 vs. limit=22.5 2023-06-27 06:05:13,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1741632.0, ans=0.5 2023-06-27 06:05:51,879 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-27 06:05:59,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1741812.0, ans=0.125 2023-06-27 06:06:00,813 INFO [train.py:996] (3/4) Epoch 10, batch 15850, loss[loss=0.2185, simple_loss=0.2904, pruned_loss=0.07329, over 21815.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2825, pruned_loss=0.06775, over 4248020.74 frames. ], batch size: 118, lr: 2.93e-03, grad_scale: 8.0 2023-06-27 06:06:48,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1741932.0, ans=0.125 2023-06-27 06:06:56,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1741932.0, ans=0.125 2023-06-27 06:06:57,726 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.205e+02 6.570e+02 8.492e+02 1.187e+03 2.613e+03, threshold=1.698e+03, percent-clipped=9.0 2023-06-27 06:07:47,407 INFO [train.py:996] (3/4) Epoch 10, batch 15900, loss[loss=0.2033, simple_loss=0.2839, pruned_loss=0.06131, over 21750.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2845, pruned_loss=0.06885, over 4247183.34 frames. ], batch size: 124, lr: 2.93e-03, grad_scale: 8.0 2023-06-27 06:07:52,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1742112.0, ans=0.95 2023-06-27 06:09:33,388 INFO [train.py:996] (3/4) Epoch 10, batch 15950, loss[loss=0.2533, simple_loss=0.3344, pruned_loss=0.08607, over 21685.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2854, pruned_loss=0.06576, over 4241785.42 frames. ], batch size: 441, lr: 2.93e-03, grad_scale: 8.0 2023-06-27 06:10:20,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1742532.0, ans=0.1 2023-06-27 06:10:31,658 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.565e+02 5.245e+02 8.616e+02 1.211e+03 4.191e+03, threshold=1.723e+03, percent-clipped=6.0 2023-06-27 06:10:37,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1742592.0, ans=0.125 2023-06-27 06:10:50,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1742592.0, ans=0.1 2023-06-27 06:11:21,898 INFO [train.py:996] (3/4) Epoch 10, batch 16000, loss[loss=0.2099, simple_loss=0.3041, pruned_loss=0.05785, over 21788.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2843, pruned_loss=0.0636, over 4253607.62 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:11:38,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1742772.0, ans=0.0 2023-06-27 06:11:45,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1742772.0, ans=0.125 2023-06-27 06:12:24,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1742892.0, ans=0.0 2023-06-27 06:12:48,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1742952.0, ans=0.05 2023-06-27 06:13:10,602 INFO [train.py:996] (3/4) Epoch 10, batch 16050, loss[loss=0.2254, simple_loss=0.3226, pruned_loss=0.06409, over 21651.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2854, pruned_loss=0.06181, over 4264186.10 frames. ], batch size: 230, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:13:14,619 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:13:26,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1743072.0, ans=0.125 2023-06-27 06:13:44,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2023-06-27 06:14:07,171 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.055e+02 6.829e+02 9.641e+02 1.432e+03 3.603e+03, threshold=1.928e+03, percent-clipped=16.0 2023-06-27 06:14:07,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1743132.0, ans=0.125 2023-06-27 06:14:09,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=1743132.0, ans=12.0 2023-06-27 06:14:21,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1743192.0, ans=0.125 2023-06-27 06:14:24,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1743192.0, ans=0.125 2023-06-27 06:14:35,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.17 vs. limit=15.0 2023-06-27 06:14:39,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1743252.0, ans=0.125 2023-06-27 06:14:48,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1743252.0, ans=0.1 2023-06-27 06:14:51,660 INFO [train.py:996] (3/4) Epoch 10, batch 16100, loss[loss=0.2567, simple_loss=0.3084, pruned_loss=0.1025, over 21823.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2887, pruned_loss=0.06352, over 4267936.32 frames. ], batch size: 508, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:15:58,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1743492.0, ans=0.2 2023-06-27 06:16:21,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1743552.0, ans=0.125 2023-06-27 06:16:27,666 INFO [train.py:996] (3/4) Epoch 10, batch 16150, loss[loss=0.2011, simple_loss=0.2841, pruned_loss=0.05911, over 21827.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2876, pruned_loss=0.06572, over 4281060.68 frames. ], batch size: 298, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:16:42,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1743612.0, ans=0.125 2023-06-27 06:16:54,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1743672.0, ans=0.0 2023-06-27 06:17:12,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1743732.0, ans=0.125 2023-06-27 06:17:20,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.73 vs. limit=15.0 2023-06-27 06:17:36,691 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.965e+02 5.965e+02 7.575e+02 1.164e+03 3.405e+03, threshold=1.515e+03, percent-clipped=4.0 2023-06-27 06:17:55,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1743792.0, ans=0.125 2023-06-27 06:17:58,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1743852.0, ans=0.025 2023-06-27 06:18:27,510 INFO [train.py:996] (3/4) Epoch 10, batch 16200, loss[loss=0.2443, simple_loss=0.3257, pruned_loss=0.08148, over 21449.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2926, pruned_loss=0.06769, over 4282413.59 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:19:30,836 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-27 06:20:06,790 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.11 vs. limit=8.0 2023-06-27 06:20:13,804 INFO [train.py:996] (3/4) Epoch 10, batch 16250, loss[loss=0.167, simple_loss=0.2398, pruned_loss=0.04708, over 21185.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2942, pruned_loss=0.06812, over 4282127.16 frames. ], batch size: 176, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:20:14,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1744212.0, ans=0.2 2023-06-27 06:20:19,669 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:20:22,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1744212.0, ans=0.0 2023-06-27 06:20:51,661 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-27 06:21:10,957 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.705e+02 5.225e+02 6.820e+02 1.048e+03 2.777e+03, threshold=1.364e+03, percent-clipped=10.0 2023-06-27 06:21:18,985 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.93 vs. limit=15.0 2023-06-27 06:22:00,271 INFO [train.py:996] (3/4) Epoch 10, batch 16300, loss[loss=0.1865, simple_loss=0.2556, pruned_loss=0.05867, over 21764.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2877, pruned_loss=0.06423, over 4266507.67 frames. ], batch size: 112, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:22:38,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1744572.0, ans=0.0 2023-06-27 06:22:47,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=12.0 2023-06-27 06:22:57,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1744632.0, ans=0.2 2023-06-27 06:23:01,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1744692.0, ans=0.125 2023-06-27 06:23:34,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1744752.0, ans=0.1 2023-06-27 06:23:41,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1744752.0, ans=0.125 2023-06-27 06:23:48,222 INFO [train.py:996] (3/4) Epoch 10, batch 16350, loss[loss=0.2395, simple_loss=0.312, pruned_loss=0.08348, over 21651.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2878, pruned_loss=0.06495, over 4257092.83 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:24:05,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1744812.0, ans=0.07 2023-06-27 06:24:16,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1744872.0, ans=0.0 2023-06-27 06:24:30,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1744872.0, ans=0.125 2023-06-27 06:24:39,905 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.02 vs. limit=15.0 2023-06-27 06:24:45,631 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.899e+02 6.082e+02 8.252e+02 1.130e+03 2.497e+03, threshold=1.650e+03, percent-clipped=10.0 2023-06-27 06:25:05,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1744992.0, ans=0.125 2023-06-27 06:25:26,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.10 vs. limit=10.0 2023-06-27 06:25:35,478 INFO [train.py:996] (3/4) Epoch 10, batch 16400, loss[loss=0.2091, simple_loss=0.2879, pruned_loss=0.06517, over 21820.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2936, pruned_loss=0.06686, over 4264180.77 frames. ], batch size: 298, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 06:26:07,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-27 06:26:24,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1745232.0, ans=0.04949747468305833 2023-06-27 06:26:49,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1745292.0, ans=0.1 2023-06-27 06:27:22,341 INFO [train.py:996] (3/4) Epoch 10, batch 16450, loss[loss=0.2413, simple_loss=0.3025, pruned_loss=0.09008, over 21625.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2932, pruned_loss=0.068, over 4266520.34 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 06:27:23,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1745412.0, ans=0.125 2023-06-27 06:27:31,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1745412.0, ans=0.125 2023-06-27 06:27:36,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1745412.0, ans=0.0 2023-06-27 06:28:19,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.272e+02 6.597e+02 9.235e+02 1.601e+03 3.322e+03, threshold=1.847e+03, percent-clipped=22.0 2023-06-27 06:29:15,254 INFO [train.py:996] (3/4) Epoch 10, batch 16500, loss[loss=0.1697, simple_loss=0.2276, pruned_loss=0.05586, over 21340.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2912, pruned_loss=0.06788, over 4263797.70 frames. ], batch size: 131, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:29:40,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1745772.0, ans=0.1 2023-06-27 06:29:54,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1745832.0, ans=0.0 2023-06-27 06:31:10,032 INFO [train.py:996] (3/4) Epoch 10, batch 16550, loss[loss=0.2507, simple_loss=0.3331, pruned_loss=0.08413, over 21462.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.291, pruned_loss=0.06595, over 4266191.09 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:31:30,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1746072.0, ans=0.125 2023-06-27 06:31:30,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1746072.0, ans=0.1 2023-06-27 06:31:35,285 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-27 06:31:57,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=15.0 2023-06-27 06:32:00,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1746132.0, ans=0.07 2023-06-27 06:32:01,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746132.0, ans=0.1 2023-06-27 06:32:04,176 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=22.5 2023-06-27 06:32:08,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1746132.0, ans=0.2 2023-06-27 06:32:11,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.354e+02 1.023e+03 1.715e+03 3.969e+03, threshold=2.045e+03, percent-clipped=20.0 2023-06-27 06:32:45,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.84 vs. limit=6.0 2023-06-27 06:33:01,725 INFO [train.py:996] (3/4) Epoch 10, batch 16600, loss[loss=0.2522, simple_loss=0.3494, pruned_loss=0.07755, over 21367.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3001, pruned_loss=0.06959, over 4271584.82 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:33:08,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746312.0, ans=0.1 2023-06-27 06:33:15,543 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=15.0 2023-06-27 06:33:15,602 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-27 06:33:32,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1746372.0, ans=0.0 2023-06-27 06:33:43,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1746372.0, ans=0.125 2023-06-27 06:34:50,933 INFO [train.py:996] (3/4) Epoch 10, batch 16650, loss[loss=0.2474, simple_loss=0.3221, pruned_loss=0.08635, over 21469.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3061, pruned_loss=0.07159, over 4267157.85 frames. ], batch size: 194, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:35:31,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1746672.0, ans=0.5 2023-06-27 06:35:58,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.748e+02 7.097e+02 9.518e+02 1.581e+03 3.619e+03, threshold=1.904e+03, percent-clipped=14.0 2023-06-27 06:36:00,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746792.0, ans=0.1 2023-06-27 06:36:02,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-27 06:36:04,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1746792.0, ans=0.1 2023-06-27 06:36:47,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1746912.0, ans=0.1 2023-06-27 06:36:48,635 INFO [train.py:996] (3/4) Epoch 10, batch 16700, loss[loss=0.2354, simple_loss=0.3192, pruned_loss=0.07576, over 21911.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3066, pruned_loss=0.07184, over 4270534.97 frames. ], batch size: 372, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:37:12,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-27 06:37:40,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1747032.0, ans=0.125 2023-06-27 06:37:53,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1747092.0, ans=0.0 2023-06-27 06:38:04,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1747092.0, ans=0.2 2023-06-27 06:38:12,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-27 06:38:15,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1747092.0, ans=0.125 2023-06-27 06:38:17,200 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:38:24,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1747152.0, ans=0.07 2023-06-27 06:38:30,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1747152.0, ans=0.1 2023-06-27 06:38:41,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1747152.0, ans=0.125 2023-06-27 06:38:46,413 INFO [train.py:996] (3/4) Epoch 10, batch 16750, loss[loss=0.2059, simple_loss=0.2527, pruned_loss=0.07954, over 20052.00 frames. ], tot_loss[loss=0.228, simple_loss=0.308, pruned_loss=0.074, over 4269822.37 frames. ], batch size: 704, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:39:11,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-27 06:39:53,241 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.731e+02 7.125e+02 1.124e+03 1.580e+03 3.763e+03, threshold=2.248e+03, percent-clipped=17.0 2023-06-27 06:40:12,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1747392.0, ans=0.0 2023-06-27 06:40:15,306 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.94 vs. limit=6.0 2023-06-27 06:40:20,319 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-27 06:40:40,770 INFO [train.py:996] (3/4) Epoch 10, batch 16800, loss[loss=0.22, simple_loss=0.2686, pruned_loss=0.08567, over 20044.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3102, pruned_loss=0.0736, over 4256032.60 frames. ], batch size: 702, lr: 2.93e-03, grad_scale: 32.0 2023-06-27 06:40:44,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1747512.0, ans=0.125 2023-06-27 06:40:58,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1747572.0, ans=0.0 2023-06-27 06:41:12,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-06-27 06:41:12,742 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-27 06:41:41,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1747692.0, ans=0.125 2023-06-27 06:42:26,693 INFO [train.py:996] (3/4) Epoch 10, batch 16850, loss[loss=0.2153, simple_loss=0.2896, pruned_loss=0.07054, over 21372.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3066, pruned_loss=0.07388, over 4269367.55 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:42:35,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1747812.0, ans=0.0 2023-06-27 06:43:06,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1747932.0, ans=0.1 2023-06-27 06:43:16,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.41 vs. limit=15.0 2023-06-27 06:43:27,405 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.310e+02 6.690e+02 9.145e+02 1.519e+03 3.869e+03, threshold=1.829e+03, percent-clipped=12.0 2023-06-27 06:44:12,644 INFO [train.py:996] (3/4) Epoch 10, batch 16900, loss[loss=0.2387, simple_loss=0.352, pruned_loss=0.06275, over 20789.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3024, pruned_loss=0.0725, over 4271997.23 frames. ], batch size: 607, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:44:32,464 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-27 06:44:59,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1748232.0, ans=0.125 2023-06-27 06:45:25,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-27 06:45:59,640 INFO [train.py:996] (3/4) Epoch 10, batch 16950, loss[loss=0.2128, simple_loss=0.2841, pruned_loss=0.07079, over 21258.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2965, pruned_loss=0.07109, over 4271461.87 frames. ], batch size: 159, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:47:00,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.671e+02 6.123e+02 1.009e+03 1.392e+03 3.065e+03, threshold=2.019e+03, percent-clipped=11.0 2023-06-27 06:47:10,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1748592.0, ans=0.0 2023-06-27 06:47:47,008 INFO [train.py:996] (3/4) Epoch 10, batch 17000, loss[loss=0.2003, simple_loss=0.2746, pruned_loss=0.06294, over 21956.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2946, pruned_loss=0.07122, over 4273589.47 frames. ], batch size: 316, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:48:16,309 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-27 06:48:26,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1748772.0, ans=0.0 2023-06-27 06:49:22,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1748952.0, ans=0.125 2023-06-27 06:49:35,360 INFO [train.py:996] (3/4) Epoch 10, batch 17050, loss[loss=0.2902, simple_loss=0.3845, pruned_loss=0.09792, over 21680.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3027, pruned_loss=0.07303, over 4270286.19 frames. ], batch size: 414, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:49:36,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1749012.0, ans=0.125 2023-06-27 06:49:36,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-27 06:50:21,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1749132.0, ans=0.2 2023-06-27 06:50:39,791 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.381e+02 7.817e+02 1.217e+03 1.816e+03 4.089e+03, threshold=2.434e+03, percent-clipped=19.0 2023-06-27 06:50:45,221 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:51:08,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1749252.0, ans=0.1 2023-06-27 06:51:20,583 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-27 06:51:20,971 INFO [train.py:996] (3/4) Epoch 10, batch 17100, loss[loss=0.2288, simple_loss=0.2993, pruned_loss=0.07917, over 21314.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3036, pruned_loss=0.07419, over 4277008.12 frames. ], batch size: 143, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:51:21,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1749312.0, ans=0.0 2023-06-27 06:51:23,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1749312.0, ans=0.125 2023-06-27 06:51:45,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1749372.0, ans=0.0 2023-06-27 06:52:06,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1749432.0, ans=0.125 2023-06-27 06:52:20,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1749432.0, ans=0.07 2023-06-27 06:52:25,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1749432.0, ans=0.04949747468305833 2023-06-27 06:53:07,971 INFO [train.py:996] (3/4) Epoch 10, batch 17150, loss[loss=0.1971, simple_loss=0.2756, pruned_loss=0.0593, over 21812.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2989, pruned_loss=0.07368, over 4287987.52 frames. ], batch size: 112, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:53:11,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1749612.0, ans=0.0 2023-06-27 06:53:34,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1749672.0, ans=0.1 2023-06-27 06:53:45,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1749672.0, ans=0.125 2023-06-27 06:53:59,359 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 06:54:04,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1749732.0, ans=0.125 2023-06-27 06:54:16,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.998e+02 6.290e+02 9.886e+02 1.236e+03 2.278e+03, threshold=1.977e+03, percent-clipped=0.0 2023-06-27 06:55:01,802 INFO [train.py:996] (3/4) Epoch 10, batch 17200, loss[loss=0.2797, simple_loss=0.3381, pruned_loss=0.1107, over 21310.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2977, pruned_loss=0.0728, over 4288595.07 frames. ], batch size: 507, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:55:06,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1749912.0, ans=0.125 2023-06-27 06:55:52,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1750032.0, ans=0.125 2023-06-27 06:56:44,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1750152.0, ans=0.025 2023-06-27 06:56:56,957 INFO [train.py:996] (3/4) Epoch 10, batch 17250, loss[loss=0.2541, simple_loss=0.3351, pruned_loss=0.08653, over 21929.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2993, pruned_loss=0.07358, over 4284538.88 frames. ], batch size: 317, lr: 2.93e-03, grad_scale: 16.0 2023-06-27 06:56:59,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750212.0, ans=0.1 2023-06-27 06:57:45,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-27 06:57:53,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1750332.0, ans=0.0 2023-06-27 06:57:57,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1750332.0, ans=0.125 2023-06-27 06:58:00,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.260e+02 7.026e+02 1.059e+03 1.492e+03 2.502e+03, threshold=2.118e+03, percent-clipped=5.0 2023-06-27 06:58:50,669 INFO [train.py:996] (3/4) Epoch 10, batch 17300, loss[loss=0.2789, simple_loss=0.3416, pruned_loss=0.1081, over 21427.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3073, pruned_loss=0.07719, over 4287265.05 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 06:59:12,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1750572.0, ans=0.2 2023-06-27 06:59:20,130 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=22.5 2023-06-27 06:59:56,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1750692.0, ans=0.1 2023-06-27 07:00:39,973 INFO [train.py:996] (3/4) Epoch 10, batch 17350, loss[loss=0.191, simple_loss=0.2774, pruned_loss=0.05231, over 21375.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.308, pruned_loss=0.07595, over 4287688.55 frames. ], batch size: 211, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:00:55,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1750812.0, ans=0.1 2023-06-27 07:01:09,997 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-27 07:01:35,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1750932.0, ans=0.125 2023-06-27 07:01:43,499 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.572e+02 6.288e+02 8.971e+02 1.269e+03 2.386e+03, threshold=1.794e+03, percent-clipped=3.0 2023-06-27 07:02:08,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1750992.0, ans=0.0 2023-06-27 07:02:11,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1751052.0, ans=0.0 2023-06-27 07:02:19,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1751052.0, ans=0.125 2023-06-27 07:02:35,912 INFO [train.py:996] (3/4) Epoch 10, batch 17400, loss[loss=0.2508, simple_loss=0.3383, pruned_loss=0.08164, over 21626.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.3039, pruned_loss=0.0724, over 4280033.44 frames. ], batch size: 441, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:02:59,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1751172.0, ans=0.0 2023-06-27 07:03:17,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-27 07:03:43,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1751292.0, ans=0.125 2023-06-27 07:03:45,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1751292.0, ans=0.0 2023-06-27 07:03:45,646 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-27 07:04:24,533 INFO [train.py:996] (3/4) Epoch 10, batch 17450, loss[loss=0.2252, simple_loss=0.3012, pruned_loss=0.07456, over 20572.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2989, pruned_loss=0.07041, over 4272561.69 frames. ], batch size: 607, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:04:40,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1751472.0, ans=0.0 2023-06-27 07:04:45,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1751472.0, ans=0.0 2023-06-27 07:05:31,485 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.811e+02 5.744e+02 7.670e+02 1.157e+03 3.080e+03, threshold=1.534e+03, percent-clipped=10.0 2023-06-27 07:06:02,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1751652.0, ans=0.0 2023-06-27 07:06:11,705 INFO [train.py:996] (3/4) Epoch 10, batch 17500, loss[loss=0.1621, simple_loss=0.2401, pruned_loss=0.04211, over 16550.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2944, pruned_loss=0.06852, over 4270437.11 frames. ], batch size: 60, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:07:24,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1751892.0, ans=0.125 2023-06-27 07:07:27,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1751892.0, ans=0.125 2023-06-27 07:07:28,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1751892.0, ans=0.125 2023-06-27 07:07:59,007 INFO [train.py:996] (3/4) Epoch 10, batch 17550, loss[loss=0.1976, simple_loss=0.2963, pruned_loss=0.04948, over 21826.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2954, pruned_loss=0.06747, over 4260596.92 frames. ], batch size: 316, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:08:01,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1752012.0, ans=0.1 2023-06-27 07:08:50,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-06-27 07:09:00,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1752192.0, ans=0.0 2023-06-27 07:09:08,413 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.979e+02 5.513e+02 7.220e+02 1.144e+03 2.854e+03, threshold=1.444e+03, percent-clipped=10.0 2023-06-27 07:09:09,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1752192.0, ans=0.125 2023-06-27 07:09:48,125 INFO [train.py:996] (3/4) Epoch 10, batch 17600, loss[loss=0.2291, simple_loss=0.3093, pruned_loss=0.07441, over 21712.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2986, pruned_loss=0.06828, over 4264950.78 frames. ], batch size: 332, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:09:48,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1752312.0, ans=0.04949747468305833 2023-06-27 07:10:12,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-27 07:10:34,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1752432.0, ans=0.0 2023-06-27 07:10:46,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1752432.0, ans=0.0 2023-06-27 07:10:52,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1752432.0, ans=0.125 2023-06-27 07:11:36,215 INFO [train.py:996] (3/4) Epoch 10, batch 17650, loss[loss=0.2472, simple_loss=0.3309, pruned_loss=0.0817, over 21254.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2958, pruned_loss=0.06797, over 4250229.67 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:11:53,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1752612.0, ans=0.1 2023-06-27 07:12:08,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1752672.0, ans=0.125 2023-06-27 07:12:30,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1752732.0, ans=0.125 2023-06-27 07:12:51,343 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.366e+02 6.949e+02 1.125e+03 1.736e+03 3.581e+03, threshold=2.249e+03, percent-clipped=33.0 2023-06-27 07:13:07,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1752852.0, ans=0.0 2023-06-27 07:13:07,891 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1752852.0, ans=0.125 2023-06-27 07:13:30,127 INFO [train.py:996] (3/4) Epoch 10, batch 17700, loss[loss=0.2359, simple_loss=0.3204, pruned_loss=0.07574, over 21752.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2929, pruned_loss=0.06607, over 4256680.19 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:13:30,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1752912.0, ans=0.02 2023-06-27 07:14:04,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1752972.0, ans=0.125 2023-06-27 07:14:04,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1752972.0, ans=0.125 2023-06-27 07:15:09,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1753152.0, ans=0.125 2023-06-27 07:15:11,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1753152.0, ans=0.07 2023-06-27 07:15:25,777 INFO [train.py:996] (3/4) Epoch 10, batch 17750, loss[loss=0.2506, simple_loss=0.3249, pruned_loss=0.08811, over 21396.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.299, pruned_loss=0.06908, over 4260752.47 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:15:30,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1753212.0, ans=0.125 2023-06-27 07:15:58,964 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-27 07:16:31,246 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.520e+02 6.307e+02 8.574e+02 1.258e+03 1.929e+03, threshold=1.715e+03, percent-clipped=0.0 2023-06-27 07:17:15,947 INFO [train.py:996] (3/4) Epoch 10, batch 17800, loss[loss=0.2248, simple_loss=0.3195, pruned_loss=0.06501, over 21306.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2973, pruned_loss=0.06794, over 4263565.64 frames. ], batch size: 549, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:17:33,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-27 07:17:34,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1753512.0, ans=0.125 2023-06-27 07:18:16,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1753632.0, ans=0.125 2023-06-27 07:18:19,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1753692.0, ans=0.5 2023-06-27 07:18:21,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1753692.0, ans=0.0 2023-06-27 07:18:47,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1753752.0, ans=0.125 2023-06-27 07:19:09,858 INFO [train.py:996] (3/4) Epoch 10, batch 17850, loss[loss=0.231, simple_loss=0.2974, pruned_loss=0.08225, over 21243.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.298, pruned_loss=0.06923, over 4265013.56 frames. ], batch size: 176, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:19:23,292 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-27 07:20:12,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1753992.0, ans=0.0 2023-06-27 07:20:19,501 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.075e+02 5.767e+02 7.778e+02 1.051e+03 2.491e+03, threshold=1.556e+03, percent-clipped=2.0 2023-06-27 07:20:27,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1753992.0, ans=0.0 2023-06-27 07:20:59,090 INFO [train.py:996] (3/4) Epoch 10, batch 17900, loss[loss=0.2235, simple_loss=0.2978, pruned_loss=0.07459, over 19972.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3038, pruned_loss=0.07144, over 4272833.91 frames. ], batch size: 703, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:21:23,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1754172.0, ans=0.2 2023-06-27 07:21:38,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1754232.0, ans=0.125 2023-06-27 07:22:32,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1754352.0, ans=0.125 2023-06-27 07:22:36,576 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-27 07:22:46,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1754352.0, ans=0.125 2023-06-27 07:22:49,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1754352.0, ans=0.125 2023-06-27 07:22:54,514 INFO [train.py:996] (3/4) Epoch 10, batch 17950, loss[loss=0.2072, simple_loss=0.3062, pruned_loss=0.05416, over 21674.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3047, pruned_loss=0.06896, over 4266884.49 frames. ], batch size: 414, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:23:44,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1754532.0, ans=0.0 2023-06-27 07:23:57,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 6.955e+02 1.067e+03 1.323e+03 3.422e+03, threshold=2.134e+03, percent-clipped=13.0 2023-06-27 07:24:14,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-27 07:24:20,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1754652.0, ans=0.2 2023-06-27 07:24:35,943 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.04 vs. limit=10.0 2023-06-27 07:24:41,309 INFO [train.py:996] (3/4) Epoch 10, batch 18000, loss[loss=0.2265, simple_loss=0.3361, pruned_loss=0.05848, over 20769.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2986, pruned_loss=0.06726, over 4261068.86 frames. ], batch size: 607, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 07:24:41,310 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 07:24:57,989 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.0198, 3.0599, 2.9099, 2.0672], device='cuda:3') 2023-06-27 07:24:59,824 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2583, simple_loss=0.3514, pruned_loss=0.08255, over 1796401.00 frames. 2023-06-27 07:24:59,826 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 07:25:44,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-27 07:26:08,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1754892.0, ans=0.125 2023-06-27 07:26:48,133 INFO [train.py:996] (3/4) Epoch 10, batch 18050, loss[loss=0.1794, simple_loss=0.2638, pruned_loss=0.04748, over 21707.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2927, pruned_loss=0.06653, over 4266085.12 frames. ], batch size: 282, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 07:27:08,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1755012.0, ans=0.1 2023-06-27 07:27:56,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755192.0, ans=0.1 2023-06-27 07:28:06,130 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.888e+02 5.313e+02 7.169e+02 9.498e+02 2.481e+03, threshold=1.434e+03, percent-clipped=3.0 2023-06-27 07:28:37,176 INFO [train.py:996] (3/4) Epoch 10, batch 18100, loss[loss=0.2209, simple_loss=0.3148, pruned_loss=0.0635, over 21622.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2964, pruned_loss=0.06823, over 4265624.08 frames. ], batch size: 263, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:28:58,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.07 vs. limit=22.5 2023-06-27 07:29:00,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1755312.0, ans=0.125 2023-06-27 07:29:30,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1755432.0, ans=0.125 2023-06-27 07:29:56,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1755492.0, ans=0.0 2023-06-27 07:30:01,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1755492.0, ans=0.07 2023-06-27 07:30:04,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1755492.0, ans=0.125 2023-06-27 07:30:23,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1755612.0, ans=0.07 2023-06-27 07:30:24,618 INFO [train.py:996] (3/4) Epoch 10, batch 18150, loss[loss=0.255, simple_loss=0.3731, pruned_loss=0.06839, over 19750.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2986, pruned_loss=0.06726, over 4269343.55 frames. ], batch size: 702, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:30:25,938 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-27 07:30:57,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1755672.0, ans=0.0 2023-06-27 07:31:29,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-27 07:31:42,083 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.282e+02 6.043e+02 8.888e+02 1.339e+03 2.734e+03, threshold=1.778e+03, percent-clipped=20.0 2023-06-27 07:32:04,198 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1755852.0, ans=0.0 2023-06-27 07:32:11,843 INFO [train.py:996] (3/4) Epoch 10, batch 18200, loss[loss=0.2118, simple_loss=0.2654, pruned_loss=0.07907, over 21333.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2926, pruned_loss=0.06725, over 4261935.83 frames. ], batch size: 473, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:32:38,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1755972.0, ans=0.95 2023-06-27 07:32:49,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1755972.0, ans=0.125 2023-06-27 07:32:54,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1756032.0, ans=0.125 2023-06-27 07:32:56,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1756032.0, ans=0.125 2023-06-27 07:33:10,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1756032.0, ans=10.0 2023-06-27 07:33:19,830 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:33:26,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1756092.0, ans=0.2 2023-06-27 07:33:56,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1756212.0, ans=0.2 2023-06-27 07:33:57,151 INFO [train.py:996] (3/4) Epoch 10, batch 18250, loss[loss=0.2159, simple_loss=0.2911, pruned_loss=0.07038, over 21718.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2852, pruned_loss=0.0655, over 4270118.35 frames. ], batch size: 389, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:33:58,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-27 07:34:21,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1756272.0, ans=0.1 2023-06-27 07:34:30,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1756272.0, ans=0.125 2023-06-27 07:34:35,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1756272.0, ans=0.5 2023-06-27 07:34:40,526 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:35:06,624 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.839e+02 5.360e+02 7.214e+02 1.131e+03 2.943e+03, threshold=1.443e+03, percent-clipped=6.0 2023-06-27 07:35:41,584 INFO [train.py:996] (3/4) Epoch 10, batch 18300, loss[loss=0.2033, simple_loss=0.28, pruned_loss=0.06328, over 21462.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2845, pruned_loss=0.06556, over 4278150.47 frames. ], batch size: 131, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:36:34,992 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:36:46,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1756692.0, ans=0.0 2023-06-27 07:37:27,279 INFO [train.py:996] (3/4) Epoch 10, batch 18350, loss[loss=0.2036, simple_loss=0.2842, pruned_loss=0.06154, over 21468.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2898, pruned_loss=0.06487, over 4267608.65 frames. ], batch size: 389, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:37:39,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-27 07:37:52,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1756872.0, ans=0.0 2023-06-27 07:38:33,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-27 07:38:39,347 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.953e+02 5.887e+02 8.763e+02 1.316e+03 3.037e+03, threshold=1.753e+03, percent-clipped=16.0 2023-06-27 07:38:41,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1756992.0, ans=0.2 2023-06-27 07:38:45,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1756992.0, ans=0.125 2023-06-27 07:39:05,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1757052.0, ans=0.2 2023-06-27 07:39:11,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1757052.0, ans=0.125 2023-06-27 07:39:16,523 INFO [train.py:996] (3/4) Epoch 10, batch 18400, loss[loss=0.1523, simple_loss=0.2326, pruned_loss=0.036, over 21358.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2854, pruned_loss=0.06393, over 4270494.29 frames. ], batch size: 131, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 07:39:51,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1757172.0, ans=0.125 2023-06-27 07:39:56,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1757172.0, ans=0.125 2023-06-27 07:40:12,500 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-27 07:40:15,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1757232.0, ans=0.1 2023-06-27 07:40:40,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-06-27 07:40:51,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1757352.0, ans=0.125 2023-06-27 07:41:04,181 INFO [train.py:996] (3/4) Epoch 10, batch 18450, loss[loss=0.1742, simple_loss=0.2716, pruned_loss=0.03837, over 21730.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2831, pruned_loss=0.06178, over 4271982.51 frames. ], batch size: 415, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:42:14,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1757592.0, ans=0.125 2023-06-27 07:42:17,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.327e+02 4.825e+02 6.029e+02 8.495e+02 1.994e+03, threshold=1.206e+03, percent-clipped=1.0 2023-06-27 07:42:40,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1757652.0, ans=0.2 2023-06-27 07:42:40,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1757652.0, ans=0.125 2023-06-27 07:42:42,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=22.5 2023-06-27 07:42:50,177 INFO [train.py:996] (3/4) Epoch 10, batch 18500, loss[loss=0.1702, simple_loss=0.2493, pruned_loss=0.0456, over 21375.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2771, pruned_loss=0.06029, over 4265129.55 frames. ], batch size: 194, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:43:51,161 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1757832.0, ans=0.125 2023-06-27 07:44:37,160 INFO [train.py:996] (3/4) Epoch 10, batch 18550, loss[loss=0.2221, simple_loss=0.2807, pruned_loss=0.08171, over 21320.00 frames. ], tot_loss[loss=0.1977, simple_loss=0.2757, pruned_loss=0.05986, over 4264941.32 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:45:04,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1758072.0, ans=0.05 2023-06-27 07:45:05,573 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:45:57,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.953e+02 6.338e+02 9.700e+02 1.484e+03 3.316e+03, threshold=1.940e+03, percent-clipped=34.0 2023-06-27 07:46:24,946 INFO [train.py:996] (3/4) Epoch 10, batch 18600, loss[loss=0.1776, simple_loss=0.2512, pruned_loss=0.05204, over 21204.00 frames. ], tot_loss[loss=0.1962, simple_loss=0.2731, pruned_loss=0.05962, over 4270252.27 frames. ], batch size: 144, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:46:27,952 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-27 07:47:24,180 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:47:37,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1758492.0, ans=0.125 2023-06-27 07:48:09,077 INFO [train.py:996] (3/4) Epoch 10, batch 18650, loss[loss=0.2192, simple_loss=0.3096, pruned_loss=0.06443, over 21754.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2735, pruned_loss=0.06006, over 4266434.38 frames. ], batch size: 415, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:48:11,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1758612.0, ans=0.5 2023-06-27 07:48:29,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1758672.0, ans=0.05 2023-06-27 07:48:36,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1758672.0, ans=0.125 2023-06-27 07:48:42,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1758672.0, ans=0.2 2023-06-27 07:49:21,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.715e+02 5.496e+02 8.127e+02 1.461e+03 3.115e+03, threshold=1.625e+03, percent-clipped=10.0 2023-06-27 07:49:22,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.39 vs. limit=12.0 2023-06-27 07:49:33,770 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-27 07:49:38,669 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:49:40,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1758852.0, ans=0.125 2023-06-27 07:49:53,335 INFO [train.py:996] (3/4) Epoch 10, batch 18700, loss[loss=0.2171, simple_loss=0.2952, pruned_loss=0.06947, over 22038.00 frames. ], tot_loss[loss=0.1978, simple_loss=0.2723, pruned_loss=0.06166, over 4265041.72 frames. ], batch size: 113, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:49:55,536 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:49:57,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.66 vs. limit=22.5 2023-06-27 07:50:17,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1758972.0, ans=0.2 2023-06-27 07:50:25,106 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:50:40,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1759032.0, ans=0.125 2023-06-27 07:51:05,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1759092.0, ans=0.125 2023-06-27 07:51:39,683 INFO [train.py:996] (3/4) Epoch 10, batch 18750, loss[loss=0.2361, simple_loss=0.3158, pruned_loss=0.07823, over 21193.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2742, pruned_loss=0.06349, over 4266357.86 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 8.0 2023-06-27 07:51:53,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.00 vs. limit=22.5 2023-06-27 07:52:16,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1759272.0, ans=0.125 2023-06-27 07:52:18,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1759272.0, ans=0.125 2023-06-27 07:52:26,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1759332.0, ans=0.0 2023-06-27 07:52:28,960 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-27 07:52:31,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1759332.0, ans=0.125 2023-06-27 07:52:35,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-27 07:52:37,247 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-27 07:52:52,811 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.039e+02 6.322e+02 1.036e+03 1.574e+03 2.810e+03, threshold=2.072e+03, percent-clipped=23.0 2023-06-27 07:53:25,196 INFO [train.py:996] (3/4) Epoch 10, batch 18800, loss[loss=0.1741, simple_loss=0.2552, pruned_loss=0.04648, over 21469.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2804, pruned_loss=0.06378, over 4269824.40 frames. ], batch size: 194, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:54:09,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1759632.0, ans=0.0 2023-06-27 07:54:09,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1759632.0, ans=0.0 2023-06-27 07:54:10,680 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.64 vs. limit=10.0 2023-06-27 07:54:14,782 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 07:55:10,078 INFO [train.py:996] (3/4) Epoch 10, batch 18850, loss[loss=0.1976, simple_loss=0.2712, pruned_loss=0.06199, over 21692.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2771, pruned_loss=0.05995, over 4267757.18 frames. ], batch size: 333, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:55:26,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1759872.0, ans=0.2 2023-06-27 07:56:23,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 5.405e+02 6.936e+02 9.507e+02 2.005e+03, threshold=1.387e+03, percent-clipped=0.0 2023-06-27 07:56:26,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-06-27 07:56:28,279 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=8.0 2023-06-27 07:56:34,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1759992.0, ans=0.125 2023-06-27 07:56:56,190 INFO [train.py:996] (3/4) Epoch 10, batch 18900, loss[loss=0.2112, simple_loss=0.2771, pruned_loss=0.07261, over 21778.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2736, pruned_loss=0.06033, over 4269934.21 frames. ], batch size: 316, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:57:30,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1760172.0, ans=0.125 2023-06-27 07:57:38,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1760232.0, ans=0.125 2023-06-27 07:58:42,082 INFO [train.py:996] (3/4) Epoch 10, batch 18950, loss[loss=0.2223, simple_loss=0.3266, pruned_loss=0.05903, over 21280.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2768, pruned_loss=0.06228, over 4267433.70 frames. ], batch size: 548, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 07:59:16,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.75 vs. limit=5.0 2023-06-27 07:59:25,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1760532.0, ans=0.125 2023-06-27 07:59:49,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1760592.0, ans=0.0 2023-06-27 07:59:57,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 7.333e+02 1.084e+03 1.694e+03 3.772e+03, threshold=2.167e+03, percent-clipped=36.0 2023-06-27 08:00:06,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1760652.0, ans=0.125 2023-06-27 08:00:08,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1760652.0, ans=0.1 2023-06-27 08:00:24,665 INFO [train.py:996] (3/4) Epoch 10, batch 19000, loss[loss=0.2264, simple_loss=0.3139, pruned_loss=0.0694, over 21510.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2877, pruned_loss=0.06561, over 4275088.49 frames. ], batch size: 131, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:00:48,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1760772.0, ans=0.035 2023-06-27 08:01:15,708 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-27 08:02:06,191 INFO [train.py:996] (3/4) Epoch 10, batch 19050, loss[loss=0.2246, simple_loss=0.2918, pruned_loss=0.07867, over 21817.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2912, pruned_loss=0.06887, over 4275200.15 frames. ], batch size: 282, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:02:29,955 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-27 08:02:35,026 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-27 08:02:53,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1761072.0, ans=0.05 2023-06-27 08:02:54,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1761132.0, ans=0.125 2023-06-27 08:03:11,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1761132.0, ans=0.125 2023-06-27 08:03:18,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1761192.0, ans=0.05 2023-06-27 08:03:24,545 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.013e+02 5.908e+02 6.994e+02 9.504e+02 2.053e+03, threshold=1.399e+03, percent-clipped=0.0 2023-06-27 08:03:25,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1761192.0, ans=0.1 2023-06-27 08:03:52,608 INFO [train.py:996] (3/4) Epoch 10, batch 19100, loss[loss=0.186, simple_loss=0.2497, pruned_loss=0.06109, over 21319.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2891, pruned_loss=0.06934, over 4274914.70 frames. ], batch size: 194, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:03:54,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1761312.0, ans=0.0 2023-06-27 08:03:58,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1761312.0, ans=0.125 2023-06-27 08:04:11,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1761312.0, ans=0.125 2023-06-27 08:04:33,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1761372.0, ans=0.5 2023-06-27 08:04:56,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1761432.0, ans=0.1 2023-06-27 08:05:16,478 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-06-27 08:05:42,417 INFO [train.py:996] (3/4) Epoch 10, batch 19150, loss[loss=0.2219, simple_loss=0.3197, pruned_loss=0.06202, over 21704.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2908, pruned_loss=0.069, over 4265350.94 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:06:53,112 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.207e+02 6.138e+02 1.014e+03 1.599e+03 3.928e+03, threshold=2.029e+03, percent-clipped=32.0 2023-06-27 08:07:18,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1761852.0, ans=0.125 2023-06-27 08:07:26,282 INFO [train.py:996] (3/4) Epoch 10, batch 19200, loss[loss=0.2165, simple_loss=0.3207, pruned_loss=0.05617, over 21573.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3002, pruned_loss=0.06893, over 4260752.27 frames. ], batch size: 263, lr: 2.92e-03, grad_scale: 32.0 2023-06-27 08:08:00,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1761972.0, ans=0.0 2023-06-27 08:08:20,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1762032.0, ans=0.125 2023-06-27 08:08:30,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1762092.0, ans=0.1 2023-06-27 08:08:44,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1762152.0, ans=0.125 2023-06-27 08:09:12,904 INFO [train.py:996] (3/4) Epoch 10, batch 19250, loss[loss=0.1735, simple_loss=0.2694, pruned_loss=0.03884, over 21787.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.3007, pruned_loss=0.06537, over 4262050.29 frames. ], batch size: 298, lr: 2.92e-03, grad_scale: 16.0 2023-06-27 08:09:32,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1762212.0, ans=0.125 2023-06-27 08:10:03,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1762332.0, ans=0.0 2023-06-27 08:10:07,044 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:10:23,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.711e+02 5.181e+02 6.655e+02 8.936e+02 1.845e+03, threshold=1.331e+03, percent-clipped=0.0 2023-06-27 08:10:59,773 INFO [train.py:996] (3/4) Epoch 10, batch 19300, loss[loss=0.2149, simple_loss=0.2898, pruned_loss=0.07002, over 21864.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2974, pruned_loss=0.06505, over 4256097.57 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:11:00,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1762512.0, ans=0.2 2023-06-27 08:11:30,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1762572.0, ans=0.125 2023-06-27 08:11:56,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1762632.0, ans=0.125 2023-06-27 08:11:59,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1762632.0, ans=0.125 2023-06-27 08:12:03,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1762692.0, ans=0.125 2023-06-27 08:12:09,229 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-06-27 08:12:21,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1762692.0, ans=0.125 2023-06-27 08:12:52,825 INFO [train.py:996] (3/4) Epoch 10, batch 19350, loss[loss=0.1659, simple_loss=0.2518, pruned_loss=0.04002, over 21544.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2911, pruned_loss=0.06232, over 4261405.10 frames. ], batch size: 195, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:13:10,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1762812.0, ans=0.1 2023-06-27 08:13:37,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1762932.0, ans=0.125 2023-06-27 08:13:50,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1762992.0, ans=0.0 2023-06-27 08:14:03,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.583e+02 5.604e+02 8.480e+02 1.112e+03 2.601e+03, threshold=1.696e+03, percent-clipped=20.0 2023-06-27 08:14:11,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1763052.0, ans=0.125 2023-06-27 08:14:19,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1763052.0, ans=0.1 2023-06-27 08:14:20,016 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-27 08:14:39,174 INFO [train.py:996] (3/4) Epoch 10, batch 19400, loss[loss=0.1749, simple_loss=0.2609, pruned_loss=0.04444, over 21797.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2891, pruned_loss=0.06148, over 4264262.32 frames. ], batch size: 282, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:14:48,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1763112.0, ans=0.125 2023-06-27 08:16:22,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1763412.0, ans=0.125 2023-06-27 08:16:23,172 INFO [train.py:996] (3/4) Epoch 10, batch 19450, loss[loss=0.2012, simple_loss=0.2638, pruned_loss=0.06925, over 21639.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2872, pruned_loss=0.06294, over 4276654.80 frames. ], batch size: 247, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:16:39,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-27 08:16:43,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=1763412.0, ans=12.0 2023-06-27 08:17:17,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1763532.0, ans=0.1 2023-06-27 08:17:19,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1763592.0, ans=0.1 2023-06-27 08:17:21,894 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.05 vs. limit=15.0 2023-06-27 08:17:34,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.998e+02 5.344e+02 8.011e+02 1.240e+03 3.010e+03, threshold=1.602e+03, percent-clipped=14.0 2023-06-27 08:17:51,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-27 08:18:11,426 INFO [train.py:996] (3/4) Epoch 10, batch 19500, loss[loss=0.1864, simple_loss=0.2495, pruned_loss=0.06167, over 21813.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2848, pruned_loss=0.06433, over 4274952.56 frames. ], batch size: 102, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:19:03,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1763892.0, ans=0.1 2023-06-27 08:19:57,004 INFO [train.py:996] (3/4) Epoch 10, batch 19550, loss[loss=0.1885, simple_loss=0.283, pruned_loss=0.04697, over 21464.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2808, pruned_loss=0.06308, over 4279576.46 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:20:53,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1764192.0, ans=0.2 2023-06-27 08:21:01,404 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.262e+02 6.617e+02 9.937e+02 1.346e+03 3.535e+03, threshold=1.987e+03, percent-clipped=18.0 2023-06-27 08:21:07,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1764192.0, ans=0.125 2023-06-27 08:21:10,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1764252.0, ans=0.0 2023-06-27 08:21:41,960 INFO [train.py:996] (3/4) Epoch 10, batch 19600, loss[loss=0.2719, simple_loss=0.3279, pruned_loss=0.108, over 21563.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.283, pruned_loss=0.064, over 4287682.71 frames. ], batch size: 471, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 08:21:53,819 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-27 08:22:07,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1764372.0, ans=0.0 2023-06-27 08:23:30,476 INFO [train.py:996] (3/4) Epoch 10, batch 19650, loss[loss=0.2108, simple_loss=0.2988, pruned_loss=0.06143, over 21419.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.286, pruned_loss=0.06708, over 4289387.57 frames. ], batch size: 131, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 08:23:50,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=12.0 2023-06-27 08:23:52,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1764672.0, ans=0.125 2023-06-27 08:24:35,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1764792.0, ans=0.1 2023-06-27 08:24:56,585 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.384e+02 6.290e+02 8.079e+02 1.063e+03 2.506e+03, threshold=1.616e+03, percent-clipped=1.0 2023-06-27 08:25:03,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1764852.0, ans=0.2 2023-06-27 08:25:22,630 INFO [train.py:996] (3/4) Epoch 10, batch 19700, loss[loss=0.2059, simple_loss=0.3058, pruned_loss=0.05296, over 20841.00 frames. ], tot_loss[loss=0.213, simple_loss=0.29, pruned_loss=0.06803, over 4283907.98 frames. ], batch size: 609, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:25:25,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1764912.0, ans=0.09899494936611666 2023-06-27 08:25:28,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1764912.0, ans=0.125 2023-06-27 08:25:45,604 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=15.0 2023-06-27 08:25:58,936 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-27 08:26:10,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1765032.0, ans=10.0 2023-06-27 08:26:41,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1765092.0, ans=0.125 2023-06-27 08:27:12,091 INFO [train.py:996] (3/4) Epoch 10, batch 19750, loss[loss=0.3104, simple_loss=0.4028, pruned_loss=0.109, over 21497.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2986, pruned_loss=0.06917, over 4274632.35 frames. ], batch size: 471, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:28:07,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1765332.0, ans=0.0 2023-06-27 08:28:10,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1765332.0, ans=0.125 2023-06-27 08:28:19,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1765332.0, ans=0.1 2023-06-27 08:28:33,897 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.148e+02 7.081e+02 1.435e+03 2.263e+03 4.438e+03, threshold=2.870e+03, percent-clipped=43.0 2023-06-27 08:28:58,220 INFO [train.py:996] (3/4) Epoch 10, batch 19800, loss[loss=0.1798, simple_loss=0.2694, pruned_loss=0.0451, over 21788.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2981, pruned_loss=0.06917, over 4271787.24 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:30:48,427 INFO [train.py:996] (3/4) Epoch 10, batch 19850, loss[loss=0.2029, simple_loss=0.3008, pruned_loss=0.05254, over 21716.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2904, pruned_loss=0.06522, over 4271631.36 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 8.0 2023-06-27 08:31:37,582 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-27 08:32:13,177 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.649e+02 5.631e+02 8.956e+02 1.493e+03 4.041e+03, threshold=1.791e+03, percent-clipped=3.0 2023-06-27 08:32:35,721 INFO [train.py:996] (3/4) Epoch 10, batch 19900, loss[loss=0.1924, simple_loss=0.2922, pruned_loss=0.04626, over 21597.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2896, pruned_loss=0.06302, over 4262682.15 frames. ], batch size: 263, lr: 2.91e-03, grad_scale: 8.0 2023-06-27 08:33:24,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-27 08:34:14,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1766352.0, ans=0.0 2023-06-27 08:34:17,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1766352.0, ans=0.125 2023-06-27 08:34:29,562 INFO [train.py:996] (3/4) Epoch 10, batch 19950, loss[loss=0.1786, simple_loss=0.2537, pruned_loss=0.05175, over 21757.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.284, pruned_loss=0.06213, over 4257146.30 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 8.0 2023-06-27 08:34:51,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1766412.0, ans=0.125 2023-06-27 08:35:17,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1766472.0, ans=0.0 2023-06-27 08:35:49,401 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.611e+02 4.941e+02 6.532e+02 1.016e+03 1.667e+03, threshold=1.306e+03, percent-clipped=0.0 2023-06-27 08:36:02,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1766652.0, ans=0.0 2023-06-27 08:36:02,563 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-06-27 08:36:21,396 INFO [train.py:996] (3/4) Epoch 10, batch 20000, loss[loss=0.1952, simple_loss=0.2733, pruned_loss=0.05857, over 21324.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2855, pruned_loss=0.06245, over 4260599.41 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:36:21,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1766712.0, ans=0.125 2023-06-27 08:36:45,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1766712.0, ans=0.05 2023-06-27 08:36:50,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1766772.0, ans=0.125 2023-06-27 08:37:13,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1766832.0, ans=0.125 2023-06-27 08:37:13,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1766832.0, ans=0.1 2023-06-27 08:38:03,071 INFO [train.py:996] (3/4) Epoch 10, batch 20050, loss[loss=0.1868, simple_loss=0.2352, pruned_loss=0.06917, over 20273.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2872, pruned_loss=0.06461, over 4273890.66 frames. ], batch size: 703, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:38:27,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1767012.0, ans=0.2 2023-06-27 08:38:29,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1767072.0, ans=0.125 2023-06-27 08:38:46,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1767132.0, ans=0.2 2023-06-27 08:39:11,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1767192.0, ans=0.1 2023-06-27 08:39:17,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1767192.0, ans=0.0 2023-06-27 08:39:18,079 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.352e+02 5.758e+02 8.028e+02 1.108e+03 2.385e+03, threshold=1.606e+03, percent-clipped=14.0 2023-06-27 08:39:56,997 INFO [train.py:996] (3/4) Epoch 10, batch 20100, loss[loss=0.2352, simple_loss=0.3324, pruned_loss=0.06895, over 21846.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2898, pruned_loss=0.06624, over 4278695.10 frames. ], batch size: 316, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:41:26,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=22.5 2023-06-27 08:41:43,607 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=15.0 2023-06-27 08:41:44,165 INFO [train.py:996] (3/4) Epoch 10, batch 20150, loss[loss=0.238, simple_loss=0.3165, pruned_loss=0.07979, over 21465.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2956, pruned_loss=0.06876, over 4273118.71 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:41:59,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1767612.0, ans=0.125 2023-06-27 08:42:05,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1767672.0, ans=0.05 2023-06-27 08:43:08,915 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 8.899e+02 1.364e+03 1.871e+03 4.503e+03, threshold=2.728e+03, percent-clipped=36.0 2023-06-27 08:43:14,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1767852.0, ans=0.09899494936611666 2023-06-27 08:43:23,636 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 08:43:31,398 INFO [train.py:996] (3/4) Epoch 10, batch 20200, loss[loss=0.21, simple_loss=0.2897, pruned_loss=0.06516, over 21699.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3027, pruned_loss=0.07099, over 4271973.66 frames. ], batch size: 247, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:43:46,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1767912.0, ans=0.125 2023-06-27 08:43:52,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1767972.0, ans=0.1 2023-06-27 08:44:36,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1768032.0, ans=0.125 2023-06-27 08:44:40,481 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.19 vs. limit=8.0 2023-06-27 08:44:43,503 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-27 08:45:18,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1768212.0, ans=0.0 2023-06-27 08:45:19,234 INFO [train.py:996] (3/4) Epoch 10, batch 20250, loss[loss=0.2034, simple_loss=0.2845, pruned_loss=0.06115, over 21863.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3029, pruned_loss=0.06957, over 4276436.35 frames. ], batch size: 124, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:46:00,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-27 08:46:35,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1768392.0, ans=0.1 2023-06-27 08:46:37,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.153e+02 5.975e+02 7.847e+02 1.054e+03 2.189e+03, threshold=1.569e+03, percent-clipped=0.0 2023-06-27 08:46:56,055 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-27 08:46:59,739 INFO [train.py:996] (3/4) Epoch 10, batch 20300, loss[loss=0.2047, simple_loss=0.2912, pruned_loss=0.05908, over 21450.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3031, pruned_loss=0.06802, over 4260405.11 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:47:04,287 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.81 vs. limit=15.0 2023-06-27 08:47:20,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1768572.0, ans=0.125 2023-06-27 08:47:20,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1768572.0, ans=0.125 2023-06-27 08:48:40,506 INFO [train.py:996] (3/4) Epoch 10, batch 20350, loss[loss=0.185, simple_loss=0.2632, pruned_loss=0.05334, over 15978.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.3028, pruned_loss=0.06844, over 4252668.73 frames. ], batch size: 60, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:48:53,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1768812.0, ans=0.1 2023-06-27 08:48:57,352 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-27 08:50:07,281 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.767e+02 5.725e+02 9.452e+02 1.415e+03 2.531e+03, threshold=1.890e+03, percent-clipped=19.0 2023-06-27 08:50:23,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1769052.0, ans=0.125 2023-06-27 08:50:29,307 INFO [train.py:996] (3/4) Epoch 10, batch 20400, loss[loss=0.2485, simple_loss=0.3283, pruned_loss=0.08432, over 21400.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.305, pruned_loss=0.07083, over 4260752.26 frames. ], batch size: 131, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 08:50:51,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1769172.0, ans=0.0 2023-06-27 08:51:22,112 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=22.5 2023-06-27 08:51:55,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1769292.0, ans=0.125 2023-06-27 08:51:55,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1769292.0, ans=0.125 2023-06-27 08:52:16,215 INFO [train.py:996] (3/4) Epoch 10, batch 20450, loss[loss=0.218, simple_loss=0.2929, pruned_loss=0.07158, over 21850.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3077, pruned_loss=0.07403, over 4260913.64 frames. ], batch size: 332, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:52:20,843 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.74 vs. limit=12.0 2023-06-27 08:53:27,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1769592.0, ans=0.05 2023-06-27 08:53:33,165 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-27 08:53:41,484 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-27 08:53:42,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.405e+02 6.127e+02 7.186e+02 1.014e+03 1.873e+03, threshold=1.437e+03, percent-clipped=1.0 2023-06-27 08:53:45,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-27 08:54:02,052 INFO [train.py:996] (3/4) Epoch 10, batch 20500, loss[loss=0.1986, simple_loss=0.2668, pruned_loss=0.06523, over 21808.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3034, pruned_loss=0.07394, over 4264777.60 frames. ], batch size: 107, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:54:07,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1769712.0, ans=0.1 2023-06-27 08:54:15,030 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-27 08:54:22,940 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1769772.0, ans=0.0 2023-06-27 08:55:48,844 INFO [train.py:996] (3/4) Epoch 10, batch 20550, loss[loss=0.1916, simple_loss=0.2708, pruned_loss=0.05616, over 21315.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2948, pruned_loss=0.07161, over 4260109.45 frames. ], batch size: 211, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:56:06,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1770072.0, ans=0.0 2023-06-27 08:56:08,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1770072.0, ans=0.1 2023-06-27 08:56:58,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=17.00 vs. limit=22.5 2023-06-27 08:57:14,706 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.809e+02 5.326e+02 8.786e+02 1.599e+03 3.543e+03, threshold=1.757e+03, percent-clipped=26.0 2023-06-27 08:57:33,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1770312.0, ans=0.125 2023-06-27 08:57:34,949 INFO [train.py:996] (3/4) Epoch 10, batch 20600, loss[loss=0.2172, simple_loss=0.3351, pruned_loss=0.0496, over 20040.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.297, pruned_loss=0.07117, over 4254013.32 frames. ], batch size: 703, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:57:51,585 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.19 vs. limit=22.5 2023-06-27 08:58:11,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1770372.0, ans=0.125 2023-06-27 08:58:27,183 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-27 08:58:40,019 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-27 08:58:53,700 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-27 08:58:54,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1770492.0, ans=0.2 2023-06-27 08:59:19,935 INFO [train.py:996] (3/4) Epoch 10, batch 20650, loss[loss=0.2458, simple_loss=0.341, pruned_loss=0.07527, over 17040.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.293, pruned_loss=0.07077, over 4245280.94 frames. ], batch size: 60, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 08:59:36,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-27 08:59:37,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1770672.0, ans=0.1 2023-06-27 08:59:47,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1770672.0, ans=0.0 2023-06-27 09:00:22,889 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.79 vs. limit=15.0 2023-06-27 09:00:45,099 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.379e+02 5.610e+02 8.426e+02 1.372e+03 2.943e+03, threshold=1.685e+03, percent-clipped=16.0 2023-06-27 09:01:06,344 INFO [train.py:996] (3/4) Epoch 10, batch 20700, loss[loss=0.2494, simple_loss=0.3254, pruned_loss=0.08671, over 21400.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2875, pruned_loss=0.06809, over 4251234.84 frames. ], batch size: 507, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:01:44,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1770972.0, ans=0.0 2023-06-27 09:01:48,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-27 09:01:51,652 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:02:15,125 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-27 09:02:31,370 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-27 09:02:51,358 INFO [train.py:996] (3/4) Epoch 10, batch 20750, loss[loss=0.2358, simple_loss=0.3363, pruned_loss=0.06767, over 21756.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2896, pruned_loss=0.06787, over 4247393.17 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:04:13,385 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.912e+02 7.283e+02 1.259e+03 1.890e+03 5.387e+03, threshold=2.519e+03, percent-clipped=32.0 2023-06-27 09:04:19,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1771452.0, ans=0.0 2023-06-27 09:04:39,262 INFO [train.py:996] (3/4) Epoch 10, batch 20800, loss[loss=0.1903, simple_loss=0.2699, pruned_loss=0.05537, over 21623.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2926, pruned_loss=0.0679, over 4244556.70 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 09:04:41,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1771512.0, ans=0.125 2023-06-27 09:04:57,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1771512.0, ans=0.0 2023-06-27 09:05:13,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1771572.0, ans=0.125 2023-06-27 09:05:23,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1771572.0, ans=0.0 2023-06-27 09:05:43,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1771632.0, ans=0.125 2023-06-27 09:06:19,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1771812.0, ans=0.1 2023-06-27 09:06:20,404 INFO [train.py:996] (3/4) Epoch 10, batch 20850, loss[loss=0.2317, simple_loss=0.2976, pruned_loss=0.08288, over 21715.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2872, pruned_loss=0.06624, over 4244892.63 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:07:16,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1771932.0, ans=0.125 2023-06-27 09:07:48,084 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.047e+02 6.784e+02 1.035e+03 1.709e+03 3.199e+03, threshold=2.070e+03, percent-clipped=7.0 2023-06-27 09:08:12,499 INFO [train.py:996] (3/4) Epoch 10, batch 20900, loss[loss=0.213, simple_loss=0.2813, pruned_loss=0.0723, over 21812.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2862, pruned_loss=0.0661, over 4255449.56 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:09:15,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1772232.0, ans=0.125 2023-06-27 09:09:51,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1772352.0, ans=0.0 2023-06-27 09:09:54,329 INFO [train.py:996] (3/4) Epoch 10, batch 20950, loss[loss=0.1645, simple_loss=0.2569, pruned_loss=0.03601, over 21771.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2812, pruned_loss=0.06327, over 4257441.28 frames. ], batch size: 332, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:10:25,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1772472.0, ans=0.0 2023-06-27 09:11:19,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.843e+02 5.854e+02 8.072e+02 1.179e+03 2.171e+03, threshold=1.614e+03, percent-clipped=1.0 2023-06-27 09:11:38,283 INFO [train.py:996] (3/4) Epoch 10, batch 21000, loss[loss=0.2348, simple_loss=0.3371, pruned_loss=0.06624, over 19799.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2793, pruned_loss=0.06296, over 4252029.90 frames. ], batch size: 703, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:11:38,284 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 09:12:02,877 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2606, simple_loss=0.3545, pruned_loss=0.08334, over 1796401.00 frames. 2023-06-27 09:12:02,878 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 09:12:05,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1772712.0, ans=0.125 2023-06-27 09:12:34,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1772772.0, ans=0.125 2023-06-27 09:13:21,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1772952.0, ans=0.125 2023-06-27 09:13:42,557 INFO [train.py:996] (3/4) Epoch 10, batch 21050, loss[loss=0.2303, simple_loss=0.2816, pruned_loss=0.08947, over 21447.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2789, pruned_loss=0.06366, over 4245024.92 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:13:50,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1773012.0, ans=0.1 2023-06-27 09:14:35,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1773132.0, ans=0.1 2023-06-27 09:14:59,142 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.065e+02 6.476e+02 8.191e+02 1.141e+03 2.345e+03, threshold=1.638e+03, percent-clipped=6.0 2023-06-27 09:15:13,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1773252.0, ans=0.0 2023-06-27 09:15:23,616 INFO [train.py:996] (3/4) Epoch 10, batch 21100, loss[loss=0.1995, simple_loss=0.2478, pruned_loss=0.07558, over 20212.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2757, pruned_loss=0.06354, over 4244007.07 frames. ], batch size: 703, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:16:58,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1773552.0, ans=0.125 2023-06-27 09:17:08,854 INFO [train.py:996] (3/4) Epoch 10, batch 21150, loss[loss=0.1937, simple_loss=0.2561, pruned_loss=0.06565, over 21776.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2717, pruned_loss=0.06416, over 4241781.41 frames. ], batch size: 352, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:17:16,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1773612.0, ans=0.0 2023-06-27 09:17:59,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1773732.0, ans=0.0 2023-06-27 09:18:00,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.57 vs. limit=22.5 2023-06-27 09:18:25,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1773792.0, ans=0.0 2023-06-27 09:18:33,090 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.032e+02 6.384e+02 8.623e+02 1.133e+03 2.526e+03, threshold=1.725e+03, percent-clipped=9.0 2023-06-27 09:18:33,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1773852.0, ans=0.0 2023-06-27 09:18:42,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1773852.0, ans=0.0 2023-06-27 09:18:51,966 INFO [train.py:996] (3/4) Epoch 10, batch 21200, loss[loss=0.1996, simple_loss=0.2683, pruned_loss=0.06551, over 21747.00 frames. ], tot_loss[loss=0.1973, simple_loss=0.2683, pruned_loss=0.06317, over 4253961.96 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 32.0 2023-06-27 09:18:52,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1773912.0, ans=0.125 2023-06-27 09:19:01,947 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-27 09:19:18,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1773972.0, ans=0.125 2023-06-27 09:19:47,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-27 09:19:49,236 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.36 vs. limit=15.0 2023-06-27 09:20:23,069 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1774152.0, ans=0.035 2023-06-27 09:20:28,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1774152.0, ans=0.125 2023-06-27 09:20:35,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1774152.0, ans=0.0 2023-06-27 09:20:44,823 INFO [train.py:996] (3/4) Epoch 10, batch 21250, loss[loss=0.1884, simple_loss=0.2639, pruned_loss=0.05644, over 21827.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.267, pruned_loss=0.0632, over 4254362.09 frames. ], batch size: 118, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:21:34,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-27 09:21:50,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1774392.0, ans=0.125 2023-06-27 09:21:57,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1774392.0, ans=0.125 2023-06-27 09:22:08,468 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.773e+02 6.524e+02 9.029e+02 1.391e+03 2.253e+03, threshold=1.806e+03, percent-clipped=10.0 2023-06-27 09:22:25,328 INFO [train.py:996] (3/4) Epoch 10, batch 21300, loss[loss=0.2466, simple_loss=0.3126, pruned_loss=0.09029, over 21782.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2722, pruned_loss=0.06439, over 4262296.38 frames. ], batch size: 441, lr: 2.91e-03, grad_scale: 16.0 2023-06-27 09:23:32,473 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.78 vs. limit=15.0 2023-06-27 09:23:36,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1774692.0, ans=0.125 2023-06-27 09:23:40,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1774692.0, ans=0.04949747468305833 2023-06-27 09:23:44,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1774692.0, ans=0.0 2023-06-27 09:23:58,775 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-27 09:24:13,072 INFO [train.py:996] (3/4) Epoch 10, batch 21350, loss[loss=0.2166, simple_loss=0.305, pruned_loss=0.06409, over 21601.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2774, pruned_loss=0.06602, over 4273589.08 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:24:24,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1774812.0, ans=0.125 2023-06-27 09:24:41,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1774872.0, ans=0.125 2023-06-27 09:24:55,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1774872.0, ans=0.125 2023-06-27 09:24:59,599 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.10 vs. limit=22.5 2023-06-27 09:25:35,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1774992.0, ans=0.125 2023-06-27 09:25:39,159 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.868e+02 6.492e+02 8.824e+02 1.457e+03 2.432e+03, threshold=1.765e+03, percent-clipped=7.0 2023-06-27 09:26:01,037 INFO [train.py:996] (3/4) Epoch 10, batch 21400, loss[loss=0.2625, simple_loss=0.3455, pruned_loss=0.08971, over 21808.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.282, pruned_loss=0.06618, over 4274486.40 frames. ], batch size: 118, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:26:13,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1775112.0, ans=0.125 2023-06-27 09:26:40,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1775172.0, ans=0.0 2023-06-27 09:27:47,916 INFO [train.py:996] (3/4) Epoch 10, batch 21450, loss[loss=0.2093, simple_loss=0.2852, pruned_loss=0.06675, over 21940.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2866, pruned_loss=0.06812, over 4273904.32 frames. ], batch size: 333, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:28:04,387 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:29:03,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-27 09:29:11,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1775652.0, ans=0.125 2023-06-27 09:29:12,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.429e+02 6.380e+02 8.632e+02 1.324e+03 3.087e+03, threshold=1.726e+03, percent-clipped=6.0 2023-06-27 09:29:21,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1775652.0, ans=0.125 2023-06-27 09:29:29,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1775652.0, ans=0.05 2023-06-27 09:29:39,666 INFO [train.py:996] (3/4) Epoch 10, batch 21500, loss[loss=0.2332, simple_loss=0.2803, pruned_loss=0.09302, over 21431.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2842, pruned_loss=0.06887, over 4273996.29 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:29:42,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1775712.0, ans=0.1 2023-06-27 09:30:37,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1775892.0, ans=0.125 2023-06-27 09:30:52,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1775952.0, ans=0.125 2023-06-27 09:31:04,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1775952.0, ans=10.0 2023-06-27 09:31:16,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1775952.0, ans=0.125 2023-06-27 09:31:25,378 INFO [train.py:996] (3/4) Epoch 10, batch 21550, loss[loss=0.1912, simple_loss=0.2541, pruned_loss=0.0641, over 21329.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2781, pruned_loss=0.06597, over 4263668.30 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:31:27,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1776012.0, ans=0.2 2023-06-27 09:32:26,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1776192.0, ans=0.1 2023-06-27 09:32:27,447 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.84 vs. limit=10.0 2023-06-27 09:32:45,479 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.045e+02 5.690e+02 8.515e+02 1.276e+03 3.905e+03, threshold=1.703e+03, percent-clipped=13.0 2023-06-27 09:33:15,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1776252.0, ans=0.0 2023-06-27 09:33:20,023 INFO [train.py:996] (3/4) Epoch 10, batch 21600, loss[loss=0.2084, simple_loss=0.2944, pruned_loss=0.06117, over 21542.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2764, pruned_loss=0.06529, over 4256967.56 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:33:26,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1776312.0, ans=0.125 2023-06-27 09:33:57,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1776432.0, ans=0.0 2023-06-27 09:34:03,347 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-27 09:34:14,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1776492.0, ans=0.1 2023-06-27 09:35:06,697 INFO [train.py:996] (3/4) Epoch 10, batch 21650, loss[loss=0.1897, simple_loss=0.2548, pruned_loss=0.06228, over 21106.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2776, pruned_loss=0.06317, over 4253973.10 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:36:26,712 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.628e+02 5.833e+02 8.995e+02 1.569e+03 2.622e+03, threshold=1.799e+03, percent-clipped=22.0 2023-06-27 09:36:30,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1776852.0, ans=0.0 2023-06-27 09:36:53,196 INFO [train.py:996] (3/4) Epoch 10, batch 21700, loss[loss=0.17, simple_loss=0.2506, pruned_loss=0.04477, over 21542.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2785, pruned_loss=0.06177, over 4263204.70 frames. ], batch size: 195, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:37:19,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1776972.0, ans=0.05 2023-06-27 09:37:48,237 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.29 vs. limit=15.0 2023-06-27 09:38:05,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1777152.0, ans=0.0 2023-06-27 09:38:38,210 INFO [train.py:996] (3/4) Epoch 10, batch 21750, loss[loss=0.1952, simple_loss=0.261, pruned_loss=0.06468, over 21648.00 frames. ], tot_loss[loss=0.1993, simple_loss=0.2748, pruned_loss=0.06187, over 4268347.43 frames. ], batch size: 282, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:39:12,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1777332.0, ans=0.125 2023-06-27 09:39:28,700 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-27 09:39:58,324 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.215e+02 5.988e+02 7.907e+02 1.038e+03 1.862e+03, threshold=1.581e+03, percent-clipped=2.0 2023-06-27 09:40:16,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1777452.0, ans=0.125 2023-06-27 09:40:24,239 INFO [train.py:996] (3/4) Epoch 10, batch 21800, loss[loss=0.1923, simple_loss=0.2672, pruned_loss=0.05864, over 21593.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2731, pruned_loss=0.06207, over 4255876.52 frames. ], batch size: 247, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:40:31,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1777512.0, ans=0.2 2023-06-27 09:40:50,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1777572.0, ans=0.0 2023-06-27 09:40:58,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1777632.0, ans=0.1 2023-06-27 09:41:31,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1777692.0, ans=0.125 2023-06-27 09:42:02,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1777752.0, ans=0.0 2023-06-27 09:42:10,436 INFO [train.py:996] (3/4) Epoch 10, batch 21850, loss[loss=0.2481, simple_loss=0.3198, pruned_loss=0.0882, over 21617.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2796, pruned_loss=0.06288, over 4241194.64 frames. ], batch size: 471, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:42:23,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777812.0, ans=0.1 2023-06-27 09:42:33,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1777872.0, ans=0.1 2023-06-27 09:43:00,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1777992.0, ans=0.0 2023-06-27 09:43:30,244 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.011e+02 6.825e+02 1.374e+03 1.718e+03 3.521e+03, threshold=2.747e+03, percent-clipped=39.0 2023-06-27 09:43:55,432 INFO [train.py:996] (3/4) Epoch 10, batch 21900, loss[loss=0.2411, simple_loss=0.2767, pruned_loss=0.1028, over 21387.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2792, pruned_loss=0.06398, over 4258532.73 frames. ], batch size: 508, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:45:10,471 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 09:45:34,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.18 vs. limit=10.0 2023-06-27 09:45:40,238 INFO [train.py:996] (3/4) Epoch 10, batch 21950, loss[loss=0.1662, simple_loss=0.2352, pruned_loss=0.04858, over 21234.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2738, pruned_loss=0.06267, over 4254423.84 frames. ], batch size: 144, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:45:42,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1778412.0, ans=0.125 2023-06-27 09:46:06,499 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=22.5 2023-06-27 09:47:06,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.644e+02 5.395e+02 6.536e+02 9.589e+02 2.193e+03, threshold=1.307e+03, percent-clipped=0.0 2023-06-27 09:47:25,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1778712.0, ans=0.2 2023-06-27 09:47:26,910 INFO [train.py:996] (3/4) Epoch 10, batch 22000, loss[loss=0.2547, simple_loss=0.3672, pruned_loss=0.07103, over 19871.00 frames. ], tot_loss[loss=0.1949, simple_loss=0.2694, pruned_loss=0.06019, over 4254141.93 frames. ], batch size: 702, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:47:35,799 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-27 09:49:18,753 INFO [train.py:996] (3/4) Epoch 10, batch 22050, loss[loss=0.2236, simple_loss=0.3139, pruned_loss=0.06665, over 21757.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2733, pruned_loss=0.06137, over 4229860.39 frames. ], batch size: 282, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 09:50:40,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-27 09:50:42,479 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=22.5 2023-06-27 09:50:51,716 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.283e+02 7.142e+02 9.080e+02 1.741e+03 3.538e+03, threshold=1.816e+03, percent-clipped=36.0 2023-06-27 09:51:05,034 INFO [train.py:996] (3/4) Epoch 10, batch 22100, loss[loss=0.2435, simple_loss=0.3139, pruned_loss=0.08648, over 21273.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2832, pruned_loss=0.06547, over 4236186.35 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:51:22,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1779372.0, ans=0.2 2023-06-27 09:52:51,246 INFO [train.py:996] (3/4) Epoch 10, batch 22150, loss[loss=0.2113, simple_loss=0.2779, pruned_loss=0.07232, over 21533.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2859, pruned_loss=0.06635, over 4249212.11 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:53:11,930 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.31 vs. limit=8.0 2023-06-27 09:53:53,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1779792.0, ans=0.125 2023-06-27 09:54:25,382 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.248e+02 5.661e+02 7.475e+02 1.093e+03 2.487e+03, threshold=1.495e+03, percent-clipped=9.0 2023-06-27 09:54:39,171 INFO [train.py:996] (3/4) Epoch 10, batch 22200, loss[loss=0.2206, simple_loss=0.3099, pruned_loss=0.06567, over 21854.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2882, pruned_loss=0.06719, over 4257217.85 frames. ], batch size: 124, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:54:39,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1779912.0, ans=0.125 2023-06-27 09:55:15,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1779972.0, ans=0.125 2023-06-27 09:55:56,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1780092.0, ans=0.125 2023-06-27 09:56:27,594 INFO [train.py:996] (3/4) Epoch 10, batch 22250, loss[loss=0.2458, simple_loss=0.3287, pruned_loss=0.08146, over 21262.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2935, pruned_loss=0.06801, over 4264897.18 frames. ], batch size: 143, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:56:34,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1780212.0, ans=0.125 2023-06-27 09:56:57,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1780272.0, ans=0.125 2023-06-27 09:57:37,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1780392.0, ans=0.1 2023-06-27 09:57:58,531 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.450e+02 6.092e+02 1.077e+03 1.479e+03 2.486e+03, threshold=2.154e+03, percent-clipped=24.0 2023-06-27 09:58:12,273 INFO [train.py:996] (3/4) Epoch 10, batch 22300, loss[loss=0.2131, simple_loss=0.2801, pruned_loss=0.07305, over 21929.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2951, pruned_loss=0.07061, over 4272182.40 frames. ], batch size: 283, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 09:58:19,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1780512.0, ans=0.125 2023-06-27 09:59:58,419 INFO [train.py:996] (3/4) Epoch 10, batch 22350, loss[loss=0.2251, simple_loss=0.2948, pruned_loss=0.07768, over 21839.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2941, pruned_loss=0.07141, over 4274541.89 frames. ], batch size: 107, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:00:00,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1780812.0, ans=0.125 2023-06-27 10:00:31,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1780872.0, ans=0.125 2023-06-27 10:00:44,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1780932.0, ans=0.0 2023-06-27 10:01:08,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1780932.0, ans=0.125 2023-06-27 10:01:32,055 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.877e+02 5.424e+02 7.099e+02 9.642e+02 1.783e+03, threshold=1.420e+03, percent-clipped=0.0 2023-06-27 10:01:45,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=1781112.0, ans=15.0 2023-06-27 10:01:45,477 INFO [train.py:996] (3/4) Epoch 10, batch 22400, loss[loss=0.2012, simple_loss=0.2731, pruned_loss=0.06468, over 21624.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2905, pruned_loss=0.06924, over 4268514.45 frames. ], batch size: 332, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 10:02:32,635 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-27 10:02:39,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1781232.0, ans=0.125 2023-06-27 10:02:49,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1781232.0, ans=0.125 2023-06-27 10:03:13,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-27 10:03:17,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2023-06-27 10:03:31,231 INFO [train.py:996] (3/4) Epoch 10, batch 22450, loss[loss=0.1928, simple_loss=0.2589, pruned_loss=0.06338, over 21181.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2847, pruned_loss=0.06844, over 4268702.02 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:04:51,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1781592.0, ans=0.125 2023-06-27 10:04:59,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1781652.0, ans=0.0 2023-06-27 10:05:05,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1781652.0, ans=0.125 2023-06-27 10:05:07,716 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.016e+02 6.073e+02 8.479e+02 1.179e+03 3.261e+03, threshold=1.696e+03, percent-clipped=18.0 2023-06-27 10:05:18,389 INFO [train.py:996] (3/4) Epoch 10, batch 22500, loss[loss=0.2235, simple_loss=0.3268, pruned_loss=0.06014, over 21646.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2808, pruned_loss=0.06747, over 4272920.05 frames. ], batch size: 298, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:05:20,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1781712.0, ans=0.125 2023-06-27 10:05:48,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1781772.0, ans=0.125 2023-06-27 10:06:47,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1781952.0, ans=0.2 2023-06-27 10:06:47,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1781952.0, ans=0.125 2023-06-27 10:07:06,833 INFO [train.py:996] (3/4) Epoch 10, batch 22550, loss[loss=0.2074, simple_loss=0.2802, pruned_loss=0.06731, over 21852.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2842, pruned_loss=0.0675, over 4281649.18 frames. ], batch size: 282, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:07:15,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1782012.0, ans=0.0 2023-06-27 10:07:15,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1782012.0, ans=0.1 2023-06-27 10:07:25,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1782072.0, ans=0.0 2023-06-27 10:07:58,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1782132.0, ans=0.0 2023-06-27 10:08:17,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1782192.0, ans=0.0 2023-06-27 10:08:23,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1782192.0, ans=0.125 2023-06-27 10:08:41,500 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.419e+02 6.178e+02 1.242e+03 1.950e+03 4.739e+03, threshold=2.485e+03, percent-clipped=29.0 2023-06-27 10:08:51,916 INFO [train.py:996] (3/4) Epoch 10, batch 22600, loss[loss=0.306, simple_loss=0.3794, pruned_loss=0.1163, over 21477.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2885, pruned_loss=0.06822, over 4282579.63 frames. ], batch size: 507, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:09:29,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1782372.0, ans=0.0 2023-06-27 10:10:00,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1782492.0, ans=0.1 2023-06-27 10:10:02,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-27 10:10:13,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1782492.0, ans=10.0 2023-06-27 10:10:32,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1782552.0, ans=0.125 2023-06-27 10:10:38,544 INFO [train.py:996] (3/4) Epoch 10, batch 22650, loss[loss=0.2124, simple_loss=0.2799, pruned_loss=0.07248, over 21817.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2845, pruned_loss=0.06793, over 4284609.36 frames. ], batch size: 98, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:10:55,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1782612.0, ans=0.125 2023-06-27 10:11:51,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1782792.0, ans=0.0 2023-06-27 10:12:16,225 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.186e+02 6.070e+02 1.000e+03 1.313e+03 3.118e+03, threshold=2.001e+03, percent-clipped=3.0 2023-06-27 10:12:26,385 INFO [train.py:996] (3/4) Epoch 10, batch 22700, loss[loss=0.1953, simple_loss=0.2675, pruned_loss=0.0616, over 21783.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2787, pruned_loss=0.06677, over 4277725.76 frames. ], batch size: 112, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:12:47,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1782912.0, ans=0.0 2023-06-27 10:13:14,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1783032.0, ans=0.1 2023-06-27 10:13:34,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1783092.0, ans=0.2 2023-06-27 10:13:44,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1783092.0, ans=0.125 2023-06-27 10:13:46,668 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=22.5 2023-06-27 10:14:12,799 INFO [train.py:996] (3/4) Epoch 10, batch 22750, loss[loss=0.1807, simple_loss=0.2572, pruned_loss=0.05214, over 21972.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2804, pruned_loss=0.06818, over 4270500.77 frames. ], batch size: 103, lr: 2.90e-03, grad_scale: 8.0 2023-06-27 10:14:43,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1783212.0, ans=0.125 2023-06-27 10:14:49,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1783272.0, ans=0.125 2023-06-27 10:15:48,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.913e+02 6.203e+02 1.029e+03 1.531e+03 3.011e+03, threshold=2.057e+03, percent-clipped=6.0 2023-06-27 10:16:04,095 INFO [train.py:996] (3/4) Epoch 10, batch 22800, loss[loss=0.2123, simple_loss=0.2823, pruned_loss=0.07115, over 21307.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2844, pruned_loss=0.0705, over 4277851.71 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:16:58,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1783632.0, ans=0.1 2023-06-27 10:17:44,762 INFO [train.py:996] (3/4) Epoch 10, batch 22850, loss[loss=0.1853, simple_loss=0.2545, pruned_loss=0.05799, over 21813.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2831, pruned_loss=0.06973, over 4273953.84 frames. ], batch size: 118, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:19:23,323 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.547e+02 9.815e+02 1.470e+03 2.221e+03 4.175e+03, threshold=2.939e+03, percent-clipped=31.0 2023-06-27 10:19:44,584 INFO [train.py:996] (3/4) Epoch 10, batch 22900, loss[loss=0.1928, simple_loss=0.3005, pruned_loss=0.04255, over 21815.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2835, pruned_loss=0.06927, over 4274180.03 frames. ], batch size: 282, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:20:32,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1784232.0, ans=0.0 2023-06-27 10:21:17,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1784352.0, ans=0.125 2023-06-27 10:21:31,354 INFO [train.py:996] (3/4) Epoch 10, batch 22950, loss[loss=0.2297, simple_loss=0.3216, pruned_loss=0.06889, over 19918.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2966, pruned_loss=0.0687, over 4274129.26 frames. ], batch size: 703, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:21:36,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1784412.0, ans=0.125 2023-06-27 10:21:59,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1784472.0, ans=0.0 2023-06-27 10:22:11,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1784532.0, ans=0.0 2023-06-27 10:22:15,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1784532.0, ans=0.0 2023-06-27 10:22:31,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1784592.0, ans=0.0 2023-06-27 10:22:48,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1784652.0, ans=0.1 2023-06-27 10:22:51,598 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.863e+02 5.875e+02 8.793e+02 1.271e+03 3.173e+03, threshold=1.759e+03, percent-clipped=4.0 2023-06-27 10:23:05,430 INFO [train.py:996] (3/4) Epoch 10, batch 23000, loss[loss=0.2509, simple_loss=0.317, pruned_loss=0.0924, over 21606.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2957, pruned_loss=0.06619, over 4275872.69 frames. ], batch size: 471, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:23:43,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1784832.0, ans=0.2 2023-06-27 10:24:02,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1784892.0, ans=0.0 2023-06-27 10:24:46,696 INFO [train.py:996] (3/4) Epoch 10, batch 23050, loss[loss=0.2592, simple_loss=0.3297, pruned_loss=0.09438, over 21805.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2967, pruned_loss=0.0681, over 4280882.47 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:24:57,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1785012.0, ans=0.0 2023-06-27 10:24:57,563 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-06-27 10:25:34,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1785132.0, ans=0.2 2023-06-27 10:25:56,975 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-06-27 10:26:07,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1785192.0, ans=0.125 2023-06-27 10:26:08,387 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-27 10:26:16,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=22.5 2023-06-27 10:26:17,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.088e+02 5.488e+02 7.273e+02 1.121e+03 2.826e+03, threshold=1.455e+03, percent-clipped=6.0 2023-06-27 10:26:17,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1785252.0, ans=0.125 2023-06-27 10:26:26,890 INFO [train.py:996] (3/4) Epoch 10, batch 23100, loss[loss=0.1841, simple_loss=0.2505, pruned_loss=0.05881, over 21437.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2919, pruned_loss=0.06858, over 4275049.74 frames. ], batch size: 131, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:26:27,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1785312.0, ans=0.125 2023-06-27 10:26:44,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1785372.0, ans=0.2 2023-06-27 10:26:48,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1785372.0, ans=0.125 2023-06-27 10:26:54,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1785372.0, ans=0.125 2023-06-27 10:27:35,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.24 vs. limit=15.0 2023-06-27 10:27:56,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1785552.0, ans=0.125 2023-06-27 10:28:02,058 INFO [train.py:996] (3/4) Epoch 10, batch 23150, loss[loss=0.1766, simple_loss=0.2314, pruned_loss=0.06089, over 20712.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2859, pruned_loss=0.06773, over 4274298.95 frames. ], batch size: 609, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:28:21,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1785672.0, ans=0.0 2023-06-27 10:29:12,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1785792.0, ans=0.1 2023-06-27 10:29:25,878 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.152e+02 5.952e+02 7.532e+02 1.121e+03 2.900e+03, threshold=1.506e+03, percent-clipped=14.0 2023-06-27 10:29:28,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.00 vs. limit=15.0 2023-06-27 10:29:35,447 INFO [train.py:996] (3/4) Epoch 10, batch 23200, loss[loss=0.2187, simple_loss=0.2866, pruned_loss=0.07546, over 21537.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2863, pruned_loss=0.06842, over 4277530.66 frames. ], batch size: 194, lr: 2.90e-03, grad_scale: 32.0 2023-06-27 10:30:03,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1785972.0, ans=0.1 2023-06-27 10:30:13,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=12.0 2023-06-27 10:31:09,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1786212.0, ans=0.0 2023-06-27 10:31:10,863 INFO [train.py:996] (3/4) Epoch 10, batch 23250, loss[loss=0.2374, simple_loss=0.3014, pruned_loss=0.08671, over 21742.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.286, pruned_loss=0.06948, over 4289621.49 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:31:29,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1786272.0, ans=0.125 2023-06-27 10:31:39,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1786272.0, ans=0.125 2023-06-27 10:31:52,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1786332.0, ans=0.125 2023-06-27 10:32:44,613 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.481e+02 7.308e+02 1.025e+03 1.554e+03 3.146e+03, threshold=2.050e+03, percent-clipped=26.0 2023-06-27 10:32:48,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1786452.0, ans=0.125 2023-06-27 10:32:52,912 INFO [train.py:996] (3/4) Epoch 10, batch 23300, loss[loss=0.2447, simple_loss=0.3572, pruned_loss=0.06609, over 21202.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.293, pruned_loss=0.07056, over 4285903.85 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:33:13,883 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-27 10:33:47,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1786632.0, ans=0.1 2023-06-27 10:34:33,874 INFO [train.py:996] (3/4) Epoch 10, batch 23350, loss[loss=0.1731, simple_loss=0.2609, pruned_loss=0.04266, over 21697.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2969, pruned_loss=0.07017, over 4289344.83 frames. ], batch size: 351, lr: 2.90e-03, grad_scale: 16.0 2023-06-27 10:34:42,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1786812.0, ans=0.5 2023-06-27 10:35:46,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1786992.0, ans=0.125 2023-06-27 10:35:47,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1786992.0, ans=0.025 2023-06-27 10:35:54,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1786992.0, ans=0.125 2023-06-27 10:36:02,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1787052.0, ans=0.125 2023-06-27 10:36:05,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.446e+02 7.072e+02 1.049e+03 1.355e+03 2.858e+03, threshold=2.098e+03, percent-clipped=8.0 2023-06-27 10:36:13,602 INFO [train.py:996] (3/4) Epoch 10, batch 23400, loss[loss=0.2073, simple_loss=0.2793, pruned_loss=0.0677, over 21835.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2907, pruned_loss=0.06711, over 4280796.02 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:36:24,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1787112.0, ans=0.0 2023-06-27 10:37:33,152 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1787292.0, ans=12.0 2023-06-27 10:37:53,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1787412.0, ans=0.125 2023-06-27 10:37:54,862 INFO [train.py:996] (3/4) Epoch 10, batch 23450, loss[loss=0.2197, simple_loss=0.2954, pruned_loss=0.07202, over 21744.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2911, pruned_loss=0.06865, over 4288538.64 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:38:16,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=12.0 2023-06-27 10:39:08,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1787592.0, ans=0.0 2023-06-27 10:39:25,541 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.955e+02 6.626e+02 1.004e+03 1.261e+03 2.377e+03, threshold=2.009e+03, percent-clipped=2.0 2023-06-27 10:39:26,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=1787652.0, ans=0.1 2023-06-27 10:39:38,047 INFO [train.py:996] (3/4) Epoch 10, batch 23500, loss[loss=0.2418, simple_loss=0.2962, pruned_loss=0.09372, over 21817.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2912, pruned_loss=0.07007, over 4291063.77 frames. ], batch size: 508, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:41:03,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1787952.0, ans=0.025 2023-06-27 10:41:08,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1787952.0, ans=0.0 2023-06-27 10:41:17,176 INFO [train.py:996] (3/4) Epoch 10, batch 23550, loss[loss=0.1988, simple_loss=0.2577, pruned_loss=0.06993, over 21604.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2884, pruned_loss=0.06965, over 4280396.69 frames. ], batch size: 414, lr: 2.89e-03, grad_scale: 8.0 2023-06-27 10:41:48,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788072.0, ans=0.1 2023-06-27 10:41:55,854 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.37 vs. limit=15.0 2023-06-27 10:42:08,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1788132.0, ans=0.125 2023-06-27 10:42:18,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788132.0, ans=0.1 2023-06-27 10:42:33,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-06-27 10:42:36,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1788252.0, ans=0.04949747468305833 2023-06-27 10:42:47,396 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.468e+02 6.896e+02 9.643e+02 1.434e+03 2.789e+03, threshold=1.929e+03, percent-clipped=7.0 2023-06-27 10:42:48,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1788252.0, ans=0.0 2023-06-27 10:42:58,647 INFO [train.py:996] (3/4) Epoch 10, batch 23600, loss[loss=0.2248, simple_loss=0.3052, pruned_loss=0.07218, over 21723.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2891, pruned_loss=0.07008, over 4278923.63 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:43:09,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=22.5 2023-06-27 10:43:15,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1788312.0, ans=0.1 2023-06-27 10:43:25,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=22.5 2023-06-27 10:44:09,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.37 vs. limit=15.0 2023-06-27 10:44:13,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788492.0, ans=0.1 2023-06-27 10:44:15,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1788492.0, ans=0.125 2023-06-27 10:44:47,507 INFO [train.py:996] (3/4) Epoch 10, batch 23650, loss[loss=0.2433, simple_loss=0.3199, pruned_loss=0.08329, over 21248.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2894, pruned_loss=0.06891, over 4275230.32 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:44:48,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1788612.0, ans=0.125 2023-06-27 10:45:17,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1788672.0, ans=0.05 2023-06-27 10:45:19,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1788672.0, ans=0.0 2023-06-27 10:45:35,031 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-27 10:45:49,540 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1788792.0, ans=0.125 2023-06-27 10:46:05,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1788792.0, ans=0.0 2023-06-27 10:46:27,252 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.326e+02 5.698e+02 8.154e+02 1.096e+03 2.339e+03, threshold=1.631e+03, percent-clipped=3.0 2023-06-27 10:46:38,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-27 10:46:38,533 INFO [train.py:996] (3/4) Epoch 10, batch 23700, loss[loss=0.231, simple_loss=0.3123, pruned_loss=0.07481, over 21575.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2928, pruned_loss=0.06858, over 4279510.84 frames. ], batch size: 414, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:46:57,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1788972.0, ans=0.2 2023-06-27 10:47:47,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789092.0, ans=0.1 2023-06-27 10:47:47,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789092.0, ans=0.1 2023-06-27 10:48:13,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1789152.0, ans=0.125 2023-06-27 10:48:19,722 INFO [train.py:996] (3/4) Epoch 10, batch 23750, loss[loss=0.2081, simple_loss=0.287, pruned_loss=0.06456, over 21764.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2953, pruned_loss=0.06982, over 4281847.50 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:48:20,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1789212.0, ans=0.125 2023-06-27 10:49:55,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.670e+02 6.081e+02 7.830e+02 1.141e+03 2.559e+03, threshold=1.566e+03, percent-clipped=8.0 2023-06-27 10:49:57,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789452.0, ans=0.1 2023-06-27 10:50:02,087 INFO [train.py:996] (3/4) Epoch 10, batch 23800, loss[loss=0.2044, simple_loss=0.2894, pruned_loss=0.05971, over 21387.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.293, pruned_loss=0.06668, over 4278542.67 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:50:09,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1789512.0, ans=0.125 2023-06-27 10:50:19,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1789572.0, ans=0.1 2023-06-27 10:51:19,477 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-27 10:51:34,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.14 vs. limit=15.0 2023-06-27 10:51:45,228 INFO [train.py:996] (3/4) Epoch 10, batch 23850, loss[loss=0.247, simple_loss=0.3325, pruned_loss=0.0807, over 21493.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3037, pruned_loss=0.06955, over 4279449.38 frames. ], batch size: 131, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:51:59,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-27 10:52:17,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.05 vs. limit=5.0 2023-06-27 10:52:48,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-27 10:52:56,411 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 10:53:18,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.153e+02 6.860e+02 1.142e+03 1.790e+03 3.579e+03, threshold=2.285e+03, percent-clipped=29.0 2023-06-27 10:53:24,732 INFO [train.py:996] (3/4) Epoch 10, batch 23900, loss[loss=0.202, simple_loss=0.2535, pruned_loss=0.07528, over 20189.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3082, pruned_loss=0.07143, over 4279974.70 frames. ], batch size: 703, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:53:30,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1790112.0, ans=0.125 2023-06-27 10:53:42,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1790112.0, ans=15.0 2023-06-27 10:54:29,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1790232.0, ans=0.125 2023-06-27 10:55:05,881 INFO [train.py:996] (3/4) Epoch 10, batch 23950, loss[loss=0.18, simple_loss=0.2451, pruned_loss=0.05748, over 21414.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3012, pruned_loss=0.07081, over 4265986.44 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:55:47,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.18 vs. limit=15.0 2023-06-27 10:55:57,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1790532.0, ans=0.1 2023-06-27 10:56:40,581 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.159e+02 7.309e+02 9.584e+02 1.406e+03 2.703e+03, threshold=1.917e+03, percent-clipped=3.0 2023-06-27 10:56:41,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2023-06-27 10:56:47,120 INFO [train.py:996] (3/4) Epoch 10, batch 24000, loss[loss=0.2538, simple_loss=0.3353, pruned_loss=0.0862, over 21277.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3031, pruned_loss=0.07385, over 4260538.86 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 10:56:47,121 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 10:57:07,139 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2621, simple_loss=0.3549, pruned_loss=0.08461, over 1796401.00 frames. 2023-06-27 10:57:07,141 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 10:57:52,169 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=22.5 2023-06-27 10:58:14,956 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1790892.0, ans=0.125 2023-06-27 10:58:45,459 INFO [train.py:996] (3/4) Epoch 10, batch 24050, loss[loss=0.2079, simple_loss=0.302, pruned_loss=0.05691, over 21632.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3034, pruned_loss=0.07358, over 4264961.35 frames. ], batch size: 414, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 10:59:05,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1791072.0, ans=0.2 2023-06-27 10:59:35,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1791132.0, ans=0.125 2023-06-27 11:00:10,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-27 11:00:21,995 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.514e+02 5.749e+02 8.023e+02 1.325e+03 2.806e+03, threshold=1.605e+03, percent-clipped=11.0 2023-06-27 11:00:32,043 INFO [train.py:996] (3/4) Epoch 10, batch 24100, loss[loss=0.2309, simple_loss=0.3148, pruned_loss=0.07347, over 21720.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.304, pruned_loss=0.0731, over 4271842.71 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:00:38,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1791312.0, ans=0.035 2023-06-27 11:00:49,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-27 11:01:08,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1791432.0, ans=0.07 2023-06-27 11:01:21,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1791492.0, ans=10.0 2023-06-27 11:01:57,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-27 11:02:13,404 INFO [train.py:996] (3/4) Epoch 10, batch 24150, loss[loss=0.2729, simple_loss=0.3339, pruned_loss=0.1059, over 21752.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3042, pruned_loss=0.07443, over 4281882.66 frames. ], batch size: 441, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:02:17,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1791612.0, ans=0.125 2023-06-27 11:02:20,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1791612.0, ans=0.07 2023-06-27 11:02:25,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1791612.0, ans=0.125 2023-06-27 11:02:40,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1791672.0, ans=0.1 2023-06-27 11:03:44,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1791852.0, ans=0.1 2023-06-27 11:03:45,388 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.503e+02 6.384e+02 9.147e+02 1.297e+03 2.622e+03, threshold=1.829e+03, percent-clipped=12.0 2023-06-27 11:03:50,485 INFO [train.py:996] (3/4) Epoch 10, batch 24200, loss[loss=0.288, simple_loss=0.3642, pruned_loss=0.1059, over 21594.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3079, pruned_loss=0.0765, over 4282332.70 frames. ], batch size: 441, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:04:30,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1792032.0, ans=0.1 2023-06-27 11:05:03,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1792092.0, ans=0.125 2023-06-27 11:05:23,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1792152.0, ans=0.125 2023-06-27 11:05:33,111 INFO [train.py:996] (3/4) Epoch 10, batch 24250, loss[loss=0.2281, simple_loss=0.3254, pruned_loss=0.06541, over 21649.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3043, pruned_loss=0.07078, over 4285887.06 frames. ], batch size: 441, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:05:38,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1792212.0, ans=0.125 2023-06-27 11:05:50,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1792272.0, ans=0.04949747468305833 2023-06-27 11:06:08,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1792272.0, ans=0.125 2023-06-27 11:06:26,300 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-27 11:06:32,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-27 11:06:58,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1792452.0, ans=0.0 2023-06-27 11:07:09,242 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.373e+02 5.833e+02 9.026e+02 1.321e+03 2.992e+03, threshold=1.805e+03, percent-clipped=10.0 2023-06-27 11:07:14,037 INFO [train.py:996] (3/4) Epoch 10, batch 24300, loss[loss=0.186, simple_loss=0.2713, pruned_loss=0.05032, over 21806.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2995, pruned_loss=0.06618, over 4285176.12 frames. ], batch size: 351, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:07:34,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1792572.0, ans=0.0 2023-06-27 11:08:00,713 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.38 vs. limit=12.0 2023-06-27 11:08:17,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1792692.0, ans=0.125 2023-06-27 11:08:48,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1792752.0, ans=0.2 2023-06-27 11:08:54,893 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-27 11:08:55,408 INFO [train.py:996] (3/4) Epoch 10, batch 24350, loss[loss=0.2183, simple_loss=0.2972, pruned_loss=0.0697, over 21814.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.297, pruned_loss=0.06523, over 4281558.87 frames. ], batch size: 351, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:08:56,693 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.78 vs. limit=12.0 2023-06-27 11:09:02,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1792812.0, ans=0.125 2023-06-27 11:09:41,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1792932.0, ans=0.0 2023-06-27 11:10:00,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=22.5 2023-06-27 11:10:17,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1793052.0, ans=0.125 2023-06-27 11:10:19,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-27 11:10:27,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.762e+02 6.342e+02 9.950e+02 1.336e+03 3.105e+03, threshold=1.990e+03, percent-clipped=13.0 2023-06-27 11:10:32,373 INFO [train.py:996] (3/4) Epoch 10, batch 24400, loss[loss=0.2092, simple_loss=0.2957, pruned_loss=0.06138, over 21597.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2981, pruned_loss=0.06734, over 4282716.42 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 11:10:36,957 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=22.5 2023-06-27 11:11:00,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1793172.0, ans=0.125 2023-06-27 11:11:53,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-27 11:12:01,373 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1793352.0, ans=0.125 2023-06-27 11:12:08,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1793352.0, ans=0.125 2023-06-27 11:12:14,394 INFO [train.py:996] (3/4) Epoch 10, batch 24450, loss[loss=0.1981, simple_loss=0.2776, pruned_loss=0.0593, over 21154.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2975, pruned_loss=0.06827, over 4280004.97 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:12:24,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1793412.0, ans=0.125 2023-06-27 11:12:34,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1793412.0, ans=0.1 2023-06-27 11:12:39,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-27 11:12:44,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1793472.0, ans=0.125 2023-06-27 11:13:38,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1793652.0, ans=0.09899494936611666 2023-06-27 11:13:46,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1793652.0, ans=0.125 2023-06-27 11:13:50,877 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.345e+02 6.636e+02 9.241e+02 1.233e+03 3.193e+03, threshold=1.848e+03, percent-clipped=3.0 2023-06-27 11:13:54,210 INFO [train.py:996] (3/4) Epoch 10, batch 24500, loss[loss=0.2062, simple_loss=0.2605, pruned_loss=0.07593, over 20291.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2986, pruned_loss=0.06863, over 4281712.38 frames. ], batch size: 703, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:14:46,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1793832.0, ans=0.04949747468305833 2023-06-27 11:15:16,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1793892.0, ans=0.125 2023-06-27 11:15:40,099 INFO [train.py:996] (3/4) Epoch 10, batch 24550, loss[loss=0.2549, simple_loss=0.336, pruned_loss=0.08689, over 21120.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2999, pruned_loss=0.07016, over 4283046.23 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:15:59,860 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-27 11:16:26,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1794132.0, ans=0.2 2023-06-27 11:16:35,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1794132.0, ans=0.0 2023-06-27 11:17:16,637 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 6.446e+02 9.214e+02 1.322e+03 3.260e+03, threshold=1.843e+03, percent-clipped=13.0 2023-06-27 11:17:19,822 INFO [train.py:996] (3/4) Epoch 10, batch 24600, loss[loss=0.2236, simple_loss=0.2992, pruned_loss=0.07407, over 20714.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2962, pruned_loss=0.07016, over 4279071.86 frames. ], batch size: 607, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:18:05,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1794432.0, ans=0.0 2023-06-27 11:18:10,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1794432.0, ans=0.0 2023-06-27 11:19:01,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1794552.0, ans=0.05 2023-06-27 11:19:05,997 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-27 11:19:06,384 INFO [train.py:996] (3/4) Epoch 10, batch 24650, loss[loss=0.2039, simple_loss=0.2764, pruned_loss=0.06572, over 20802.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2888, pruned_loss=0.06899, over 4273605.18 frames. ], batch size: 609, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:19:08,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1794612.0, ans=0.0 2023-06-27 11:19:24,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1794612.0, ans=0.0 2023-06-27 11:20:38,902 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.116e+02 6.411e+02 8.563e+02 1.154e+03 3.780e+03, threshold=1.713e+03, percent-clipped=12.0 2023-06-27 11:20:41,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1794912.0, ans=0.0 2023-06-27 11:20:42,279 INFO [train.py:996] (3/4) Epoch 10, batch 24700, loss[loss=0.2146, simple_loss=0.2854, pruned_loss=0.07193, over 21486.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2865, pruned_loss=0.06781, over 4263896.97 frames. ], batch size: 441, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:20:54,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-27 11:21:21,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1795032.0, ans=0.125 2023-06-27 11:21:32,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1795032.0, ans=0.1 2023-06-27 11:21:40,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1795092.0, ans=0.125 2023-06-27 11:21:54,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1795152.0, ans=0.1 2023-06-27 11:22:16,083 INFO [train.py:996] (3/4) Epoch 10, batch 24750, loss[loss=0.1874, simple_loss=0.2672, pruned_loss=0.05374, over 20734.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2808, pruned_loss=0.06576, over 4269008.25 frames. ], batch size: 607, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:22:21,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1795212.0, ans=0.125 2023-06-27 11:22:31,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1795272.0, ans=0.2 2023-06-27 11:22:33,186 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-27 11:22:42,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1795272.0, ans=0.0 2023-06-27 11:23:24,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1795452.0, ans=0.1 2023-06-27 11:23:38,627 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.117e+02 6.853e+02 9.586e+02 1.478e+03 3.032e+03, threshold=1.917e+03, percent-clipped=13.0 2023-06-27 11:23:46,722 INFO [train.py:996] (3/4) Epoch 10, batch 24800, loss[loss=0.2295, simple_loss=0.2899, pruned_loss=0.08456, over 21355.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2759, pruned_loss=0.06552, over 4274326.75 frames. ], batch size: 159, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 11:24:28,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1795632.0, ans=0.1 2023-06-27 11:24:33,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1795632.0, ans=0.125 2023-06-27 11:25:07,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1795752.0, ans=0.125 2023-06-27 11:25:29,258 INFO [train.py:996] (3/4) Epoch 10, batch 24850, loss[loss=0.1879, simple_loss=0.2512, pruned_loss=0.06231, over 21310.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2767, pruned_loss=0.06739, over 4275177.14 frames. ], batch size: 143, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:26:36,667 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-27 11:26:59,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.781e+02 6.964e+02 9.660e+02 1.513e+03 3.423e+03, threshold=1.932e+03, percent-clipped=14.0 2023-06-27 11:27:00,591 INFO [train.py:996] (3/4) Epoch 10, batch 24900, loss[loss=0.2308, simple_loss=0.3064, pruned_loss=0.07762, over 21844.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2791, pruned_loss=0.06805, over 4277521.04 frames. ], batch size: 282, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:27:08,187 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-27 11:27:18,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1796112.0, ans=0.1 2023-06-27 11:27:18,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1796112.0, ans=0.0 2023-06-27 11:27:20,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1796172.0, ans=0.0 2023-06-27 11:27:24,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-27 11:27:47,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1796232.0, ans=0.0 2023-06-27 11:27:55,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1796292.0, ans=0.05 2023-06-27 11:28:39,600 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-27 11:28:41,463 INFO [train.py:996] (3/4) Epoch 10, batch 24950, loss[loss=0.2243, simple_loss=0.2943, pruned_loss=0.07713, over 21586.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2871, pruned_loss=0.07206, over 4276641.57 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:28:57,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1796412.0, ans=15.0 2023-06-27 11:30:13,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1796652.0, ans=0.1 2023-06-27 11:30:19,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.654e+02 6.926e+02 9.542e+02 1.348e+03 3.788e+03, threshold=1.908e+03, percent-clipped=7.0 2023-06-27 11:30:20,989 INFO [train.py:996] (3/4) Epoch 10, batch 25000, loss[loss=0.2086, simple_loss=0.2825, pruned_loss=0.0674, over 21908.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2939, pruned_loss=0.0733, over 4278850.10 frames. ], batch size: 118, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:30:37,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1796712.0, ans=0.125 2023-06-27 11:32:11,878 INFO [train.py:996] (3/4) Epoch 10, batch 25050, loss[loss=0.2322, simple_loss=0.3395, pruned_loss=0.06247, over 19965.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2882, pruned_loss=0.07141, over 4280280.36 frames. ], batch size: 702, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:32:41,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1797072.0, ans=0.125 2023-06-27 11:32:42,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1797132.0, ans=0.0 2023-06-27 11:33:31,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1797252.0, ans=0.125 2023-06-27 11:33:38,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1797252.0, ans=0.125 2023-06-27 11:33:50,161 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.906e+02 5.494e+02 7.889e+02 1.087e+03 2.340e+03, threshold=1.578e+03, percent-clipped=4.0 2023-06-27 11:33:51,533 INFO [train.py:996] (3/4) Epoch 10, batch 25100, loss[loss=0.2136, simple_loss=0.3096, pruned_loss=0.05882, over 21852.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2836, pruned_loss=0.07102, over 4285193.69 frames. ], batch size: 371, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:33:56,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1797312.0, ans=0.2 2023-06-27 11:34:01,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1797312.0, ans=0.1 2023-06-27 11:34:01,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1797312.0, ans=0.2 2023-06-27 11:34:29,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1797432.0, ans=0.5 2023-06-27 11:34:56,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1797492.0, ans=0.04949747468305833 2023-06-27 11:35:26,551 INFO [train.py:996] (3/4) Epoch 10, batch 25150, loss[loss=0.1932, simple_loss=0.2846, pruned_loss=0.05085, over 21817.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2875, pruned_loss=0.06955, over 4281096.75 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:35:41,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1797612.0, ans=0.125 2023-06-27 11:36:15,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=12.0 2023-06-27 11:36:59,841 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.36 vs. limit=22.5 2023-06-27 11:37:03,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1797852.0, ans=0.0 2023-06-27 11:37:04,859 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.938e+02 6.755e+02 1.265e+03 1.654e+03 3.292e+03, threshold=2.530e+03, percent-clipped=31.0 2023-06-27 11:37:06,427 INFO [train.py:996] (3/4) Epoch 10, batch 25200, loss[loss=0.2076, simple_loss=0.2608, pruned_loss=0.07721, over 20077.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2864, pruned_loss=0.06786, over 4282319.00 frames. ], batch size: 702, lr: 2.89e-03, grad_scale: 32.0 2023-06-27 11:37:57,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1798032.0, ans=0.125 2023-06-27 11:38:27,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1798092.0, ans=0.125 2023-06-27 11:38:46,162 INFO [train.py:996] (3/4) Epoch 10, batch 25250, loss[loss=0.1878, simple_loss=0.258, pruned_loss=0.05874, over 21191.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2839, pruned_loss=0.06661, over 4271840.14 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:38:59,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1798212.0, ans=0.0 2023-06-27 11:39:37,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1798332.0, ans=0.0 2023-06-27 11:39:55,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1798392.0, ans=0.1 2023-06-27 11:40:06,836 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-27 11:40:32,591 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.222e+02 7.212e+02 1.026e+03 1.530e+03 2.488e+03, threshold=2.053e+03, percent-clipped=0.0 2023-06-27 11:40:32,621 INFO [train.py:996] (3/4) Epoch 10, batch 25300, loss[loss=0.1995, simple_loss=0.2829, pruned_loss=0.05808, over 21729.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2826, pruned_loss=0.0662, over 4262250.73 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:41:51,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1798752.0, ans=0.125 2023-06-27 11:41:58,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2023-06-27 11:42:13,847 INFO [train.py:996] (3/4) Epoch 10, batch 25350, loss[loss=0.178, simple_loss=0.2652, pruned_loss=0.04543, over 21761.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2847, pruned_loss=0.06573, over 4252253.49 frames. ], batch size: 282, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:42:17,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1798812.0, ans=0.2 2023-06-27 11:42:17,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-27 11:42:48,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-27 11:42:52,268 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.32 vs. limit=15.0 2023-06-27 11:43:26,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1798992.0, ans=0.125 2023-06-27 11:43:42,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1799052.0, ans=0.0 2023-06-27 11:43:53,122 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.520e+02 8.858e+02 1.308e+03 2.699e+03, threshold=1.772e+03, percent-clipped=4.0 2023-06-27 11:43:53,152 INFO [train.py:996] (3/4) Epoch 10, batch 25400, loss[loss=0.1805, simple_loss=0.2493, pruned_loss=0.05588, over 21344.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2815, pruned_loss=0.06514, over 4255346.05 frames. ], batch size: 144, lr: 2.89e-03, grad_scale: 16.0 2023-06-27 11:44:01,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-27 11:44:27,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=22.5 2023-06-27 11:44:31,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1799232.0, ans=0.125 2023-06-27 11:44:36,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1799232.0, ans=0.2 2023-06-27 11:44:37,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1799232.0, ans=0.0 2023-06-27 11:44:39,885 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.42 vs. limit=15.0 2023-06-27 11:45:34,081 INFO [train.py:996] (3/4) Epoch 10, batch 25450, loss[loss=0.2096, simple_loss=0.2757, pruned_loss=0.07178, over 21579.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2822, pruned_loss=0.06616, over 4258473.75 frames. ], batch size: 263, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:46:25,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.84 vs. limit=12.0 2023-06-27 11:46:28,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=12.0 2023-06-27 11:47:03,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1799652.0, ans=0.125 2023-06-27 11:47:16,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.168e+02 6.039e+02 8.121e+02 1.135e+03 2.521e+03, threshold=1.624e+03, percent-clipped=2.0 2023-06-27 11:47:16,348 INFO [train.py:996] (3/4) Epoch 10, batch 25500, loss[loss=0.2133, simple_loss=0.3058, pruned_loss=0.0604, over 21620.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.283, pruned_loss=0.06338, over 4261409.80 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:47:29,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1799712.0, ans=0.0 2023-06-27 11:48:24,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1799892.0, ans=0.125 2023-06-27 11:48:55,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1799952.0, ans=0.125 2023-06-27 11:48:58,445 INFO [train.py:996] (3/4) Epoch 10, batch 25550, loss[loss=0.2502, simple_loss=0.3507, pruned_loss=0.07489, over 21573.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2884, pruned_loss=0.06269, over 4259476.80 frames. ], batch size: 508, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:49:21,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1800072.0, ans=0.1 2023-06-27 11:49:34,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1800072.0, ans=0.2 2023-06-27 11:50:03,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1800192.0, ans=0.2 2023-06-27 11:50:15,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1800192.0, ans=0.125 2023-06-27 11:50:26,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1800252.0, ans=0.95 2023-06-27 11:50:39,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.324e+02 5.995e+02 1.017e+03 1.623e+03 5.096e+03, threshold=2.035e+03, percent-clipped=24.0 2023-06-27 11:50:39,267 INFO [train.py:996] (3/4) Epoch 10, batch 25600, loss[loss=0.2761, simple_loss=0.3544, pruned_loss=0.09889, over 21485.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2923, pruned_loss=0.06334, over 4266366.02 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 11:51:55,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1800492.0, ans=0.0 2023-06-27 11:52:19,295 INFO [train.py:996] (3/4) Epoch 10, batch 25650, loss[loss=0.2241, simple_loss=0.2857, pruned_loss=0.08121, over 21057.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2927, pruned_loss=0.06572, over 4261624.68 frames. ], batch size: 143, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:52:20,560 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.23 vs. limit=10.0 2023-06-27 11:53:48,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1800852.0, ans=0.0 2023-06-27 11:53:49,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1800852.0, ans=0.2 2023-06-27 11:54:00,584 INFO [train.py:996] (3/4) Epoch 10, batch 25700, loss[loss=0.215, simple_loss=0.2677, pruned_loss=0.08119, over 21376.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2907, pruned_loss=0.06688, over 4258787.63 frames. ], batch size: 473, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:54:05,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1800912.0, ans=0.0 2023-06-27 11:54:06,863 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.076e+02 8.398e+02 1.386e+03 2.056e+03 4.305e+03, threshold=2.773e+03, percent-clipped=25.0 2023-06-27 11:55:46,716 INFO [train.py:996] (3/4) Epoch 10, batch 25750, loss[loss=0.2635, simple_loss=0.3274, pruned_loss=0.0998, over 21460.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2933, pruned_loss=0.06879, over 4263065.01 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:56:00,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1801212.0, ans=0.1 2023-06-27 11:56:15,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1801272.0, ans=0.2 2023-06-27 11:56:18,698 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:56:22,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1801272.0, ans=0.0 2023-06-27 11:57:23,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1801452.0, ans=0.125 2023-06-27 11:57:39,563 INFO [train.py:996] (3/4) Epoch 10, batch 25800, loss[loss=0.2677, simple_loss=0.3428, pruned_loss=0.09627, over 21775.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3039, pruned_loss=0.07304, over 4262016.03 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:57:41,426 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.492e+02 7.009e+02 1.091e+03 1.518e+03 3.688e+03, threshold=2.182e+03, percent-clipped=4.0 2023-06-27 11:58:58,778 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 11:59:02,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-27 11:59:05,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1801752.0, ans=0.2 2023-06-27 11:59:15,126 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-06-27 11:59:22,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1801812.0, ans=0.125 2023-06-27 11:59:23,890 INFO [train.py:996] (3/4) Epoch 10, batch 25850, loss[loss=0.2124, simple_loss=0.294, pruned_loss=0.0654, over 21872.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3055, pruned_loss=0.07232, over 4264879.55 frames. ], batch size: 371, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 11:59:34,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1801812.0, ans=0.125 2023-06-27 11:59:59,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1801872.0, ans=0.2 2023-06-27 12:00:30,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.49 vs. limit=10.0 2023-06-27 12:00:34,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1801992.0, ans=0.0 2023-06-27 12:01:11,683 INFO [train.py:996] (3/4) Epoch 10, batch 25900, loss[loss=0.2206, simple_loss=0.3128, pruned_loss=0.06416, over 21514.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3071, pruned_loss=0.07287, over 4273160.70 frames. ], batch size: 230, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:01:13,362 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.315e+02 6.367e+02 8.577e+02 1.335e+03 4.211e+03, threshold=1.715e+03, percent-clipped=7.0 2023-06-27 12:02:17,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1802292.0, ans=0.2 2023-06-27 12:02:53,655 INFO [train.py:996] (3/4) Epoch 10, batch 25950, loss[loss=0.2248, simple_loss=0.3132, pruned_loss=0.06819, over 21745.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.313, pruned_loss=0.07539, over 4275796.43 frames. ], batch size: 332, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:02:55,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1802412.0, ans=0.0 2023-06-27 12:02:57,409 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:03:05,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1802412.0, ans=0.0 2023-06-27 12:03:16,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1802472.0, ans=0.07 2023-06-27 12:03:27,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1802472.0, ans=0.1 2023-06-27 12:03:49,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1802532.0, ans=0.0 2023-06-27 12:04:06,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1802592.0, ans=0.1 2023-06-27 12:04:35,382 INFO [train.py:996] (3/4) Epoch 10, batch 26000, loss[loss=0.2476, simple_loss=0.3397, pruned_loss=0.07778, over 21439.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.313, pruned_loss=0.07416, over 4272618.52 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 12:04:37,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.008e+02 6.185e+02 7.875e+02 1.125e+03 3.104e+03, threshold=1.575e+03, percent-clipped=8.0 2023-06-27 12:05:07,391 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.78 vs. limit=10.0 2023-06-27 12:05:08,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1802772.0, ans=0.125 2023-06-27 12:05:32,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1802832.0, ans=0.125 2023-06-27 12:05:45,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=12.0 2023-06-27 12:06:07,508 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-27 12:06:16,102 INFO [train.py:996] (3/4) Epoch 10, batch 26050, loss[loss=0.2557, simple_loss=0.3724, pruned_loss=0.06953, over 19752.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3129, pruned_loss=0.07504, over 4273431.03 frames. ], batch size: 702, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:06:23,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1803012.0, ans=0.125 2023-06-27 12:06:25,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-27 12:06:45,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1803072.0, ans=0.125 2023-06-27 12:06:59,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1803132.0, ans=0.125 2023-06-27 12:07:39,837 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:07:50,606 INFO [train.py:996] (3/4) Epoch 10, batch 26100, loss[loss=0.2352, simple_loss=0.2936, pruned_loss=0.0884, over 21787.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3081, pruned_loss=0.0752, over 4275890.36 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:07:53,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.444e+02 6.064e+02 8.418e+02 1.151e+03 2.910e+03, threshold=1.684e+03, percent-clipped=10.0 2023-06-27 12:07:56,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-27 12:08:18,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.05 vs. limit=22.5 2023-06-27 12:08:19,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1803372.0, ans=0.1 2023-06-27 12:08:21,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1803372.0, ans=0.1 2023-06-27 12:08:24,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1803372.0, ans=0.125 2023-06-27 12:08:43,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1803432.0, ans=0.0 2023-06-27 12:09:21,859 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-27 12:09:30,845 INFO [train.py:996] (3/4) Epoch 10, batch 26150, loss[loss=0.2206, simple_loss=0.2986, pruned_loss=0.07128, over 21879.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3045, pruned_loss=0.07538, over 4282539.17 frames. ], batch size: 371, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:10:10,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1803672.0, ans=0.125 2023-06-27 12:10:57,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.79 vs. limit=22.5 2023-06-27 12:11:14,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1803852.0, ans=0.125 2023-06-27 12:11:16,869 INFO [train.py:996] (3/4) Epoch 10, batch 26200, loss[loss=0.2417, simple_loss=0.3498, pruned_loss=0.06678, over 21622.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3063, pruned_loss=0.07385, over 4288734.76 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:11:20,490 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.988e+02 7.097e+02 1.092e+03 1.637e+03 2.606e+03, threshold=2.184e+03, percent-clipped=21.0 2023-06-27 12:11:46,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1803972.0, ans=0.0 2023-06-27 12:11:58,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1804032.0, ans=0.0 2023-06-27 12:12:04,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1804032.0, ans=0.0 2023-06-27 12:12:35,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1804152.0, ans=0.125 2023-06-27 12:12:56,970 INFO [train.py:996] (3/4) Epoch 10, batch 26250, loss[loss=0.2938, simple_loss=0.3501, pruned_loss=0.1188, over 21758.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3085, pruned_loss=0.0721, over 4288412.73 frames. ], batch size: 508, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:13:00,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1804212.0, ans=0.0 2023-06-27 12:13:20,196 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.58 vs. limit=15.0 2023-06-27 12:13:32,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1804332.0, ans=0.125 2023-06-27 12:14:07,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1804392.0, ans=0.1 2023-06-27 12:14:26,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1804452.0, ans=0.125 2023-06-27 12:14:36,303 INFO [train.py:996] (3/4) Epoch 10, batch 26300, loss[loss=0.2187, simple_loss=0.2974, pruned_loss=0.06999, over 21919.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3057, pruned_loss=0.07298, over 4296615.58 frames. ], batch size: 107, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:14:39,661 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.217e+02 5.994e+02 7.746e+02 1.132e+03 2.553e+03, threshold=1.549e+03, percent-clipped=2.0 2023-06-27 12:15:43,821 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:16:01,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1804752.0, ans=0.035 2023-06-27 12:16:03,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1804752.0, ans=0.0 2023-06-27 12:16:16,764 INFO [train.py:996] (3/4) Epoch 10, batch 26350, loss[loss=0.2261, simple_loss=0.3006, pruned_loss=0.07579, over 20740.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.304, pruned_loss=0.07334, over 4297277.68 frames. ], batch size: 607, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:16:33,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1804872.0, ans=0.125 2023-06-27 12:16:50,067 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-27 12:17:18,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1804992.0, ans=0.0 2023-06-27 12:17:23,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1804992.0, ans=0.125 2023-06-27 12:17:52,008 INFO [train.py:996] (3/4) Epoch 10, batch 26400, loss[loss=0.2252, simple_loss=0.2776, pruned_loss=0.08639, over 21478.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.299, pruned_loss=0.07358, over 4292773.26 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 12:17:55,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.612e+02 7.254e+02 1.118e+03 1.690e+03 3.507e+03, threshold=2.236e+03, percent-clipped=29.0 2023-06-27 12:18:11,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1805172.0, ans=0.0 2023-06-27 12:18:28,617 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1805172.0, ans=0.125 2023-06-27 12:18:41,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=15.0 2023-06-27 12:18:42,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1805232.0, ans=0.125 2023-06-27 12:19:33,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1805352.0, ans=0.0 2023-06-27 12:19:36,210 INFO [train.py:996] (3/4) Epoch 10, batch 26450, loss[loss=0.2639, simple_loss=0.3808, pruned_loss=0.07347, over 21136.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.298, pruned_loss=0.07266, over 4281653.19 frames. ], batch size: 549, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:20:02,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1805472.0, ans=0.2 2023-06-27 12:20:33,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-27 12:20:37,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1805532.0, ans=0.1 2023-06-27 12:21:19,538 INFO [train.py:996] (3/4) Epoch 10, batch 26500, loss[loss=0.2297, simple_loss=0.3134, pruned_loss=0.07298, over 21679.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3018, pruned_loss=0.07197, over 4283084.01 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:21:28,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.787e+02 8.473e+02 1.317e+03 2.228e+03 4.940e+03, threshold=2.635e+03, percent-clipped=24.0 2023-06-27 12:21:36,939 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:22:08,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1805832.0, ans=0.125 2023-06-27 12:22:30,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1805892.0, ans=0.0 2023-06-27 12:22:38,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1805892.0, ans=0.125 2023-06-27 12:22:53,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1805952.0, ans=0.125 2023-06-27 12:23:05,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1805952.0, ans=0.1 2023-06-27 12:23:07,854 INFO [train.py:996] (3/4) Epoch 10, batch 26550, loss[loss=0.1785, simple_loss=0.2635, pruned_loss=0.04673, over 21660.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2983, pruned_loss=0.06965, over 4269014.65 frames. ], batch size: 247, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:24:00,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1806132.0, ans=0.5 2023-06-27 12:24:00,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1806132.0, ans=0.125 2023-06-27 12:24:28,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1806192.0, ans=10.0 2023-06-27 12:24:41,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1806252.0, ans=0.1 2023-06-27 12:24:53,637 INFO [train.py:996] (3/4) Epoch 10, batch 26600, loss[loss=0.2061, simple_loss=0.2849, pruned_loss=0.06363, over 21645.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2974, pruned_loss=0.06769, over 4262993.91 frames. ], batch size: 332, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:25:02,995 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 9.676e+02 1.340e+03 1.727e+03 3.782e+03, threshold=2.679e+03, percent-clipped=7.0 2023-06-27 12:25:13,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1806312.0, ans=0.0 2023-06-27 12:25:33,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-06-27 12:25:41,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1806432.0, ans=0.0 2023-06-27 12:26:38,849 INFO [train.py:996] (3/4) Epoch 10, batch 26650, loss[loss=0.2173, simple_loss=0.3043, pruned_loss=0.06517, over 19953.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2907, pruned_loss=0.06636, over 4245555.22 frames. ], batch size: 702, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:26:55,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1806612.0, ans=0.0 2023-06-27 12:28:11,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1806852.0, ans=0.2 2023-06-27 12:28:18,262 INFO [train.py:996] (3/4) Epoch 10, batch 26700, loss[loss=0.1895, simple_loss=0.2669, pruned_loss=0.05604, over 21858.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2833, pruned_loss=0.06313, over 4258029.37 frames. ], batch size: 282, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:28:23,319 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.270e+02 4.740e+02 5.974e+02 7.751e+02 2.095e+03, threshold=1.195e+03, percent-clipped=0.0 2023-06-27 12:28:56,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.59 vs. limit=12.0 2023-06-27 12:28:58,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1807032.0, ans=6.0 2023-06-27 12:29:33,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1807092.0, ans=0.0 2023-06-27 12:30:03,644 INFO [train.py:996] (3/4) Epoch 10, batch 26750, loss[loss=0.2109, simple_loss=0.3001, pruned_loss=0.0608, over 21618.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2842, pruned_loss=0.06313, over 4270901.34 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:30:10,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1807212.0, ans=0.125 2023-06-27 12:31:18,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1807392.0, ans=0.2 2023-06-27 12:31:45,851 INFO [train.py:996] (3/4) Epoch 10, batch 26800, loss[loss=0.2232, simple_loss=0.3086, pruned_loss=0.06887, over 21508.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2913, pruned_loss=0.06647, over 4271856.91 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 32.0 2023-06-27 12:31:51,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.485e+02 8.253e+02 1.353e+03 2.004e+03 3.922e+03, threshold=2.706e+03, percent-clipped=54.0 2023-06-27 12:32:34,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-27 12:32:53,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1807692.0, ans=0.2 2023-06-27 12:33:01,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-27 12:33:15,614 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.81 vs. limit=15.0 2023-06-27 12:33:19,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1807752.0, ans=0.1 2023-06-27 12:33:27,231 INFO [train.py:996] (3/4) Epoch 10, batch 26850, loss[loss=0.2111, simple_loss=0.2794, pruned_loss=0.07136, over 20792.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2924, pruned_loss=0.06856, over 4273674.94 frames. ], batch size: 609, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:34:50,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=12.0 2023-06-27 12:34:57,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1808052.0, ans=0.1 2023-06-27 12:34:57,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1808052.0, ans=0.2 2023-06-27 12:35:07,077 INFO [train.py:996] (3/4) Epoch 10, batch 26900, loss[loss=0.1861, simple_loss=0.2439, pruned_loss=0.06413, over 21337.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.285, pruned_loss=0.06771, over 4275915.29 frames. ], batch size: 177, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:35:13,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.943e+02 6.437e+02 8.362e+02 1.264e+03 2.899e+03, threshold=1.672e+03, percent-clipped=1.0 2023-06-27 12:36:07,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.09 vs. limit=6.0 2023-06-27 12:36:12,163 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-27 12:36:15,481 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-27 12:36:25,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1808292.0, ans=0.2 2023-06-27 12:36:36,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1808352.0, ans=0.125 2023-06-27 12:36:40,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1808352.0, ans=6.0 2023-06-27 12:36:42,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1808352.0, ans=0.125 2023-06-27 12:36:46,475 INFO [train.py:996] (3/4) Epoch 10, batch 26950, loss[loss=0.1898, simple_loss=0.2508, pruned_loss=0.06442, over 21650.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.285, pruned_loss=0.06828, over 4259204.52 frames. ], batch size: 333, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:38:06,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1808592.0, ans=0.125 2023-06-27 12:38:12,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.46 vs. limit=12.0 2023-06-27 12:38:27,736 INFO [train.py:996] (3/4) Epoch 10, batch 27000, loss[loss=0.1811, simple_loss=0.2701, pruned_loss=0.04603, over 21700.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2852, pruned_loss=0.06608, over 4267364.56 frames. ], batch size: 332, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:38:27,737 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 12:38:47,563 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2474, simple_loss=0.3368, pruned_loss=0.07904, over 1796401.00 frames. 2023-06-27 12:38:47,564 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 12:39:01,424 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.008e+02 5.840e+02 8.267e+02 1.216e+03 2.372e+03, threshold=1.653e+03, percent-clipped=7.0 2023-06-27 12:39:38,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1808832.0, ans=0.125 2023-06-27 12:39:41,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1808832.0, ans=0.2 2023-06-27 12:39:59,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-27 12:40:00,486 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-27 12:40:11,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1808952.0, ans=0.035 2023-06-27 12:40:29,858 INFO [train.py:996] (3/4) Epoch 10, batch 27050, loss[loss=0.2166, simple_loss=0.2935, pruned_loss=0.0698, over 21871.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2888, pruned_loss=0.0644, over 4269724.96 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:40:48,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1809012.0, ans=0.125 2023-06-27 12:40:49,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1809072.0, ans=0.0 2023-06-27 12:40:51,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1809072.0, ans=0.125 2023-06-27 12:41:36,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1809192.0, ans=0.95 2023-06-27 12:42:10,019 INFO [train.py:996] (3/4) Epoch 10, batch 27100, loss[loss=0.2087, simple_loss=0.3111, pruned_loss=0.05313, over 21750.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2892, pruned_loss=0.06446, over 4276659.32 frames. ], batch size: 247, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:42:22,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.015e+02 5.614e+02 8.365e+02 1.169e+03 2.643e+03, threshold=1.673e+03, percent-clipped=10.0 2023-06-27 12:43:21,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1809492.0, ans=0.125 2023-06-27 12:43:51,700 INFO [train.py:996] (3/4) Epoch 10, batch 27150, loss[loss=0.2674, simple_loss=0.3636, pruned_loss=0.08556, over 21827.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.3007, pruned_loss=0.06791, over 4273861.59 frames. ], batch size: 371, lr: 2.88e-03, grad_scale: 8.0 2023-06-27 12:44:21,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1809672.0, ans=0.0 2023-06-27 12:44:28,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1809672.0, ans=0.1 2023-06-27 12:44:29,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1809672.0, ans=0.1 2023-06-27 12:44:29,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1809672.0, ans=0.125 2023-06-27 12:44:50,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1809792.0, ans=0.1 2023-06-27 12:45:05,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1809792.0, ans=0.1 2023-06-27 12:45:37,925 INFO [train.py:996] (3/4) Epoch 10, batch 27200, loss[loss=0.2445, simple_loss=0.3224, pruned_loss=0.0833, over 20684.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3065, pruned_loss=0.06951, over 4264000.95 frames. ], batch size: 607, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:45:43,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1809912.0, ans=0.1 2023-06-27 12:45:50,768 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.524e+02 6.689e+02 1.006e+03 1.593e+03 2.972e+03, threshold=2.013e+03, percent-clipped=22.0 2023-06-27 12:46:03,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1809972.0, ans=0.1 2023-06-27 12:46:14,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1810032.0, ans=0.125 2023-06-27 12:46:22,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1810032.0, ans=0.125 2023-06-27 12:47:09,151 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-27 12:47:14,574 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:47:18,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1810212.0, ans=0.125 2023-06-27 12:47:18,999 INFO [train.py:996] (3/4) Epoch 10, batch 27250, loss[loss=0.2599, simple_loss=0.3367, pruned_loss=0.09151, over 21735.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3077, pruned_loss=0.07251, over 4266096.59 frames. ], batch size: 124, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:47:19,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1810212.0, ans=0.0 2023-06-27 12:47:33,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1810212.0, ans=0.125 2023-06-27 12:47:35,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.83 vs. limit=10.0 2023-06-27 12:48:05,893 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:48:30,347 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:48:45,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1810452.0, ans=0.125 2023-06-27 12:48:50,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1810452.0, ans=0.95 2023-06-27 12:48:58,233 INFO [train.py:996] (3/4) Epoch 10, batch 27300, loss[loss=0.2371, simple_loss=0.3157, pruned_loss=0.07925, over 21334.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3105, pruned_loss=0.07434, over 4265856.44 frames. ], batch size: 159, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:49:06,653 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.508e+02 9.291e+02 1.314e+03 3.410e+03, threshold=1.858e+03, percent-clipped=10.0 2023-06-27 12:49:21,229 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.08 vs. limit=6.0 2023-06-27 12:49:21,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.31 vs. limit=15.0 2023-06-27 12:49:39,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1810632.0, ans=0.125 2023-06-27 12:50:22,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1810752.0, ans=0.125 2023-06-27 12:50:38,160 INFO [train.py:996] (3/4) Epoch 10, batch 27350, loss[loss=0.2073, simple_loss=0.2881, pruned_loss=0.06329, over 21449.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3135, pruned_loss=0.07478, over 4267265.91 frames. ], batch size: 194, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:51:36,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1810992.0, ans=0.0 2023-06-27 12:51:42,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1810992.0, ans=0.125 2023-06-27 12:52:00,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1811052.0, ans=0.125 2023-06-27 12:52:00,712 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-27 12:52:12,672 INFO [train.py:996] (3/4) Epoch 10, batch 27400, loss[loss=0.2092, simple_loss=0.2777, pruned_loss=0.07031, over 21541.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3081, pruned_loss=0.07393, over 4272586.66 frames. ], batch size: 548, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:52:20,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 5.726e+02 8.020e+02 1.365e+03 2.836e+03, threshold=1.604e+03, percent-clipped=8.0 2023-06-27 12:52:52,824 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:53:07,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1811232.0, ans=0.125 2023-06-27 12:53:25,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1811292.0, ans=0.125 2023-06-27 12:53:27,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1811292.0, ans=0.125 2023-06-27 12:53:33,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1811292.0, ans=0.125 2023-06-27 12:53:45,355 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 12:53:54,326 INFO [train.py:996] (3/4) Epoch 10, batch 27450, loss[loss=0.2099, simple_loss=0.291, pruned_loss=0.06443, over 21421.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3023, pruned_loss=0.07276, over 4281781.67 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:53:56,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1811412.0, ans=0.1 2023-06-27 12:54:39,476 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.25 vs. limit=15.0 2023-06-27 12:54:52,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1811532.0, ans=0.2 2023-06-27 12:55:01,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1811592.0, ans=0.0 2023-06-27 12:55:30,300 INFO [train.py:996] (3/4) Epoch 10, batch 27500, loss[loss=0.2336, simple_loss=0.3195, pruned_loss=0.07386, over 21848.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3022, pruned_loss=0.07347, over 4279358.53 frames. ], batch size: 107, lr: 2.88e-03, grad_scale: 16.0 2023-06-27 12:55:30,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1811712.0, ans=0.125 2023-06-27 12:55:38,235 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.956e+02 6.120e+02 9.251e+02 1.541e+03 3.924e+03, threshold=1.850e+03, percent-clipped=23.0 2023-06-27 12:56:08,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-27 12:56:36,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1811892.0, ans=0.125 2023-06-27 12:57:09,569 INFO [train.py:996] (3/4) Epoch 10, batch 27550, loss[loss=0.1981, simple_loss=0.2704, pruned_loss=0.06294, over 21482.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2966, pruned_loss=0.07079, over 4285092.78 frames. ], batch size: 389, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 12:57:16,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1812012.0, ans=0.1 2023-06-27 12:57:21,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1812012.0, ans=0.2 2023-06-27 12:57:57,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1812132.0, ans=0.125 2023-06-27 12:58:13,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1812132.0, ans=0.035 2023-06-27 12:58:14,220 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-27 12:58:18,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1812192.0, ans=0.125 2023-06-27 12:58:23,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1812192.0, ans=0.125 2023-06-27 12:58:28,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1812192.0, ans=0.2 2023-06-27 12:58:48,791 INFO [train.py:996] (3/4) Epoch 10, batch 27600, loss[loss=0.2453, simple_loss=0.2871, pruned_loss=0.1018, over 21341.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2892, pruned_loss=0.06928, over 4287754.49 frames. ], batch size: 508, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 12:58:51,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1812312.0, ans=0.0 2023-06-27 12:58:56,895 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.403e+02 6.402e+02 9.119e+02 1.240e+03 2.150e+03, threshold=1.824e+03, percent-clipped=4.0 2023-06-27 12:59:06,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1812312.0, ans=0.125 2023-06-27 13:00:05,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1812492.0, ans=0.125 2023-06-27 13:00:08,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1812492.0, ans=0.0 2023-06-27 13:00:11,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1812552.0, ans=0.1 2023-06-27 13:00:18,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1812552.0, ans=0.0 2023-06-27 13:00:29,606 INFO [train.py:996] (3/4) Epoch 10, batch 27650, loss[loss=0.2087, simple_loss=0.3037, pruned_loss=0.05686, over 21714.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2842, pruned_loss=0.06848, over 4280977.92 frames. ], batch size: 298, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:01:00,714 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:01:10,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1812732.0, ans=0.2 2023-06-27 13:01:41,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1812792.0, ans=0.125 2023-06-27 13:01:45,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1812792.0, ans=15.0 2023-06-27 13:02:10,578 INFO [train.py:996] (3/4) Epoch 10, batch 27700, loss[loss=0.2395, simple_loss=0.3319, pruned_loss=0.07352, over 21269.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2839, pruned_loss=0.06705, over 4282562.39 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:02:12,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1812912.0, ans=0.125 2023-06-27 13:02:23,379 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.384e+02 6.863e+02 9.869e+02 1.519e+03 3.382e+03, threshold=1.974e+03, percent-clipped=13.0 2023-06-27 13:02:25,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1812912.0, ans=0.2 2023-06-27 13:03:27,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1813092.0, ans=0.125 2023-06-27 13:03:44,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1813152.0, ans=0.125 2023-06-27 13:03:49,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1813212.0, ans=0.125 2023-06-27 13:03:50,223 INFO [train.py:996] (3/4) Epoch 10, batch 27750, loss[loss=0.2121, simple_loss=0.2899, pruned_loss=0.06714, over 21245.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2857, pruned_loss=0.06644, over 4279031.86 frames. ], batch size: 176, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:03:55,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1813212.0, ans=0.1 2023-06-27 13:04:47,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1813332.0, ans=0.125 2023-06-27 13:05:24,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-27 13:05:28,653 INFO [train.py:996] (3/4) Epoch 10, batch 27800, loss[loss=0.2359, simple_loss=0.3002, pruned_loss=0.08575, over 21650.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2858, pruned_loss=0.06722, over 4286643.43 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:05:43,192 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.682e+02 6.752e+02 9.329e+02 1.344e+03 2.939e+03, threshold=1.866e+03, percent-clipped=10.0 2023-06-27 13:06:40,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1813692.0, ans=0.125 2023-06-27 13:06:42,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1813692.0, ans=0.125 2023-06-27 13:06:47,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1813692.0, ans=0.1 2023-06-27 13:07:09,261 INFO [train.py:996] (3/4) Epoch 10, batch 27850, loss[loss=0.2145, simple_loss=0.2774, pruned_loss=0.07582, over 21617.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2846, pruned_loss=0.0678, over 4290419.51 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:08:12,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1813932.0, ans=0.0 2023-06-27 13:08:12,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1813932.0, ans=0.1 2023-06-27 13:08:20,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1813992.0, ans=0.2 2023-06-27 13:08:25,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1813992.0, ans=0.04949747468305833 2023-06-27 13:08:28,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.80 vs. limit=22.5 2023-06-27 13:09:01,132 INFO [train.py:996] (3/4) Epoch 10, batch 27900, loss[loss=0.2238, simple_loss=0.3197, pruned_loss=0.06399, over 21630.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2957, pruned_loss=0.06891, over 4293048.77 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:09:01,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1814112.0, ans=0.125 2023-06-27 13:09:15,847 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.509e+02 6.352e+02 8.865e+02 1.400e+03 2.806e+03, threshold=1.773e+03, percent-clipped=7.0 2023-06-27 13:09:28,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1814172.0, ans=0.2 2023-06-27 13:09:33,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1814172.0, ans=0.1 2023-06-27 13:09:57,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1814232.0, ans=0.125 2023-06-27 13:09:57,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1814232.0, ans=0.05 2023-06-27 13:10:31,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1814352.0, ans=0.125 2023-06-27 13:10:48,713 INFO [train.py:996] (3/4) Epoch 10, batch 27950, loss[loss=0.1898, simple_loss=0.2784, pruned_loss=0.05065, over 21543.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2958, pruned_loss=0.06508, over 4289482.76 frames. ], batch size: 230, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:10:51,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.71 vs. limit=15.0 2023-06-27 13:11:12,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1814472.0, ans=0.2 2023-06-27 13:12:03,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1814592.0, ans=0.1 2023-06-27 13:12:28,084 INFO [train.py:996] (3/4) Epoch 10, batch 28000, loss[loss=0.2425, simple_loss=0.308, pruned_loss=0.08847, over 21658.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2938, pruned_loss=0.06336, over 4290216.51 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:12:40,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1814712.0, ans=0.125 2023-06-27 13:12:42,691 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.320e+02 5.982e+02 8.841e+02 1.274e+03 3.365e+03, threshold=1.768e+03, percent-clipped=7.0 2023-06-27 13:12:54,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1814772.0, ans=0.0 2023-06-27 13:13:10,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1814832.0, ans=0.125 2023-06-27 13:14:14,332 INFO [train.py:996] (3/4) Epoch 10, batch 28050, loss[loss=0.1842, simple_loss=0.2639, pruned_loss=0.05222, over 21668.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2913, pruned_loss=0.06498, over 4290077.91 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:14:48,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1815132.0, ans=0.2 2023-06-27 13:14:54,560 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:15:54,362 INFO [train.py:996] (3/4) Epoch 10, batch 28100, loss[loss=0.1987, simple_loss=0.2637, pruned_loss=0.06683, over 21576.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2883, pruned_loss=0.06496, over 4284992.23 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:16:06,169 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.041e+02 5.968e+02 9.165e+02 1.416e+03 2.614e+03, threshold=1.833e+03, percent-clipped=9.0 2023-06-27 13:16:42,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-27 13:16:59,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1815492.0, ans=0.0 2023-06-27 13:17:34,193 INFO [train.py:996] (3/4) Epoch 10, batch 28150, loss[loss=0.1854, simple_loss=0.2566, pruned_loss=0.05708, over 21761.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2809, pruned_loss=0.06477, over 4277011.42 frames. ], batch size: 317, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:18:12,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1815732.0, ans=0.2 2023-06-27 13:18:32,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-27 13:18:59,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1815852.0, ans=0.0 2023-06-27 13:19:04,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1815852.0, ans=0.0 2023-06-27 13:19:14,722 INFO [train.py:996] (3/4) Epoch 10, batch 28200, loss[loss=0.2694, simple_loss=0.3224, pruned_loss=0.1082, over 21366.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2814, pruned_loss=0.06653, over 4278239.97 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:19:26,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.905e+02 6.047e+02 9.821e+02 1.464e+03 4.986e+03, threshold=1.964e+03, percent-clipped=9.0 2023-06-27 13:20:16,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1816092.0, ans=0.0 2023-06-27 13:20:21,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1816092.0, ans=0.125 2023-06-27 13:20:32,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1816092.0, ans=0.0 2023-06-27 13:20:54,969 INFO [train.py:996] (3/4) Epoch 10, batch 28250, loss[loss=0.1855, simple_loss=0.2448, pruned_loss=0.06314, over 20799.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2858, pruned_loss=0.06907, over 4273997.49 frames. ], batch size: 609, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:21:05,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1816212.0, ans=0.0 2023-06-27 13:21:13,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1816272.0, ans=0.0 2023-06-27 13:21:18,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1816272.0, ans=0.125 2023-06-27 13:22:36,307 INFO [train.py:996] (3/4) Epoch 10, batch 28300, loss[loss=0.1977, simple_loss=0.2862, pruned_loss=0.05462, over 21627.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2848, pruned_loss=0.06683, over 4258083.88 frames. ], batch size: 414, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:22:40,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1816512.0, ans=0.125 2023-06-27 13:22:47,921 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.032e+02 5.786e+02 9.744e+02 1.588e+03 3.149e+03, threshold=1.949e+03, percent-clipped=13.0 2023-06-27 13:22:49,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.21 vs. limit=15.0 2023-06-27 13:23:15,216 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:23:45,358 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1816692.0, ans=0.0 2023-06-27 13:24:15,592 INFO [train.py:996] (3/4) Epoch 10, batch 28350, loss[loss=0.1691, simple_loss=0.2427, pruned_loss=0.04779, over 21485.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2811, pruned_loss=0.06204, over 4239828.00 frames. ], batch size: 230, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:24:46,766 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-27 13:24:49,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1816872.0, ans=0.125 2023-06-27 13:25:00,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1816932.0, ans=0.1 2023-06-27 13:25:50,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1817052.0, ans=0.025 2023-06-27 13:25:55,977 INFO [train.py:996] (3/4) Epoch 10, batch 28400, loss[loss=0.2022, simple_loss=0.275, pruned_loss=0.06467, over 21395.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2776, pruned_loss=0.06229, over 4245026.05 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:26:18,381 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.150e+02 6.326e+02 1.038e+03 1.651e+03 3.367e+03, threshold=2.075e+03, percent-clipped=16.0 2023-06-27 13:26:19,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1817112.0, ans=0.09899494936611666 2023-06-27 13:26:20,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817172.0, ans=0.1 2023-06-27 13:26:52,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1817232.0, ans=0.125 2023-06-27 13:27:28,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817352.0, ans=0.1 2023-06-27 13:27:31,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1817352.0, ans=0.2 2023-06-27 13:27:37,249 INFO [train.py:996] (3/4) Epoch 10, batch 28450, loss[loss=0.2161, simple_loss=0.2941, pruned_loss=0.0691, over 21862.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2839, pruned_loss=0.0665, over 4259385.43 frames. ], batch size: 351, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:28:12,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-27 13:28:33,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1817532.0, ans=0.0 2023-06-27 13:28:52,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1817592.0, ans=0.1 2023-06-27 13:29:00,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817652.0, ans=0.1 2023-06-27 13:29:27,865 INFO [train.py:996] (3/4) Epoch 10, batch 28500, loss[loss=0.2062, simple_loss=0.2795, pruned_loss=0.06647, over 20694.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2862, pruned_loss=0.06872, over 4266036.90 frames. ], batch size: 607, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:29:30,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1817712.0, ans=0.125 2023-06-27 13:29:30,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1817712.0, ans=0.0 2023-06-27 13:29:50,451 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 6.822e+02 1.044e+03 1.325e+03 2.451e+03, threshold=2.088e+03, percent-clipped=2.0 2023-06-27 13:29:56,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1817772.0, ans=0.07 2023-06-27 13:30:34,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1817892.0, ans=0.0 2023-06-27 13:30:42,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1817952.0, ans=0.0 2023-06-27 13:31:05,481 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:31:14,249 INFO [train.py:996] (3/4) Epoch 10, batch 28550, loss[loss=0.2494, simple_loss=0.3517, pruned_loss=0.07353, over 21786.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2952, pruned_loss=0.07133, over 4275039.69 frames. ], batch size: 282, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:32:59,168 INFO [train.py:996] (3/4) Epoch 10, batch 28600, loss[loss=0.2218, simple_loss=0.3079, pruned_loss=0.06782, over 21768.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3019, pruned_loss=0.07314, over 4279282.49 frames. ], batch size: 118, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:33:12,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.169e+02 6.322e+02 9.283e+02 1.275e+03 2.692e+03, threshold=1.857e+03, percent-clipped=3.0 2023-06-27 13:34:05,891 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-27 13:34:35,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1818552.0, ans=0.125 2023-06-27 13:34:39,246 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:34:40,149 INFO [train.py:996] (3/4) Epoch 10, batch 28650, loss[loss=0.1846, simple_loss=0.2545, pruned_loss=0.05738, over 21542.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2948, pruned_loss=0.07171, over 4284039.61 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:35:40,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1818792.0, ans=0.125 2023-06-27 13:36:04,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1818852.0, ans=0.125 2023-06-27 13:36:10,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1818852.0, ans=22.5 2023-06-27 13:36:15,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1818912.0, ans=0.0 2023-06-27 13:36:16,628 INFO [train.py:996] (3/4) Epoch 10, batch 28700, loss[loss=0.2274, simple_loss=0.3002, pruned_loss=0.07725, over 21871.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2943, pruned_loss=0.07297, over 4277354.16 frames. ], batch size: 371, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:36:22,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1818912.0, ans=0.125 2023-06-27 13:36:29,763 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.086e+02 6.900e+02 1.037e+03 1.524e+03 3.185e+03, threshold=2.075e+03, percent-clipped=14.0 2023-06-27 13:37:29,167 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-27 13:37:33,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1819092.0, ans=0.2 2023-06-27 13:37:57,656 INFO [train.py:996] (3/4) Epoch 10, batch 28750, loss[loss=0.206, simple_loss=0.2828, pruned_loss=0.0646, over 21478.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.2947, pruned_loss=0.07333, over 4275453.23 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:38:01,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1819212.0, ans=0.125 2023-06-27 13:38:06,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1819212.0, ans=0.125 2023-06-27 13:39:14,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1819392.0, ans=10.0 2023-06-27 13:39:16,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1819452.0, ans=0.125 2023-06-27 13:39:24,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1819452.0, ans=0.125 2023-06-27 13:39:33,352 INFO [train.py:996] (3/4) Epoch 10, batch 28800, loss[loss=0.2271, simple_loss=0.3058, pruned_loss=0.07417, over 21773.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2971, pruned_loss=0.07291, over 4285568.71 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:39:47,037 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.078e+02 7.759e+02 9.840e+02 1.249e+03 3.010e+03, threshold=1.968e+03, percent-clipped=7.0 2023-06-27 13:41:09,920 INFO [train.py:996] (3/4) Epoch 10, batch 28850, loss[loss=0.2086, simple_loss=0.2806, pruned_loss=0.06829, over 21860.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2983, pruned_loss=0.0741, over 4286082.81 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:42:48,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-27 13:42:50,433 INFO [train.py:996] (3/4) Epoch 10, batch 28900, loss[loss=0.2331, simple_loss=0.2996, pruned_loss=0.08328, over 21353.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2999, pruned_loss=0.07501, over 4287072.49 frames. ], batch size: 159, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:42:51,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1820112.0, ans=0.025 2023-06-27 13:42:57,652 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:43:04,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1820112.0, ans=0.125 2023-06-27 13:43:05,419 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.892e+02 6.958e+02 1.036e+03 1.416e+03 3.093e+03, threshold=2.073e+03, percent-clipped=9.0 2023-06-27 13:43:05,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1820172.0, ans=0.2 2023-06-27 13:43:49,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1820232.0, ans=0.125 2023-06-27 13:44:33,631 INFO [train.py:996] (3/4) Epoch 10, batch 28950, loss[loss=0.2725, simple_loss=0.3902, pruned_loss=0.07741, over 19776.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2999, pruned_loss=0.07425, over 4282222.36 frames. ], batch size: 703, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:44:37,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1820412.0, ans=0.0 2023-06-27 13:45:12,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=22.5 2023-06-27 13:45:30,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=3.96 vs. limit=15.0 2023-06-27 13:45:34,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1820532.0, ans=0.125 2023-06-27 13:46:15,129 INFO [train.py:996] (3/4) Epoch 10, batch 29000, loss[loss=0.2434, simple_loss=0.3218, pruned_loss=0.08255, over 21288.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3028, pruned_loss=0.07285, over 4277285.10 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:46:43,665 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.952e+02 6.978e+02 9.216e+02 1.338e+03 4.286e+03, threshold=1.843e+03, percent-clipped=9.0 2023-06-27 13:46:58,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1820772.0, ans=0.2 2023-06-27 13:47:14,637 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-06-27 13:48:04,736 INFO [train.py:996] (3/4) Epoch 10, batch 29050, loss[loss=0.2065, simple_loss=0.2805, pruned_loss=0.06629, over 21855.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3011, pruned_loss=0.07322, over 4284042.76 frames. ], batch size: 371, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:49:13,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1821252.0, ans=0.125 2023-06-27 13:49:17,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1821252.0, ans=0.04949747468305833 2023-06-27 13:49:21,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1821252.0, ans=0.2 2023-06-27 13:49:31,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1821252.0, ans=0.125 2023-06-27 13:49:39,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1821312.0, ans=0.0 2023-06-27 13:49:40,670 INFO [train.py:996] (3/4) Epoch 10, batch 29100, loss[loss=0.1688, simple_loss=0.2279, pruned_loss=0.05481, over 21205.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2928, pruned_loss=0.07108, over 4275806.05 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:49:55,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.400e+02 6.043e+02 9.332e+02 1.585e+03 3.722e+03, threshold=1.866e+03, percent-clipped=13.0 2023-06-27 13:50:29,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1821492.0, ans=0.125 2023-06-27 13:50:37,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1821492.0, ans=0.1 2023-06-27 13:50:57,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1821552.0, ans=0.2 2023-06-27 13:51:16,637 INFO [train.py:996] (3/4) Epoch 10, batch 29150, loss[loss=0.2354, simple_loss=0.3378, pruned_loss=0.06656, over 21214.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2926, pruned_loss=0.07003, over 4279537.12 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:51:22,698 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:51:22,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1821612.0, ans=0.125 2023-06-27 13:51:32,321 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 13:51:33,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1821672.0, ans=0.5 2023-06-27 13:51:56,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1821732.0, ans=0.125 2023-06-27 13:52:01,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1821732.0, ans=0.0 2023-06-27 13:52:20,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1821792.0, ans=0.05 2023-06-27 13:52:26,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1821852.0, ans=0.0 2023-06-27 13:52:57,672 INFO [train.py:996] (3/4) Epoch 10, batch 29200, loss[loss=0.184, simple_loss=0.2572, pruned_loss=0.05538, over 21718.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2894, pruned_loss=0.06928, over 4269717.24 frames. ], batch size: 316, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 13:52:58,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1821912.0, ans=0.1 2023-06-27 13:53:08,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1821912.0, ans=0.125 2023-06-27 13:53:08,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-27 13:53:14,036 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.160e+02 1.002e+03 1.749e+03 3.498e+03, threshold=2.004e+03, percent-clipped=20.0 2023-06-27 13:53:29,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1822032.0, ans=0.125 2023-06-27 13:53:29,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1822032.0, ans=0.07 2023-06-27 13:53:46,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.50 vs. limit=15.0 2023-06-27 13:54:29,613 INFO [train.py:996] (3/4) Epoch 10, batch 29250, loss[loss=0.2047, simple_loss=0.2944, pruned_loss=0.05753, over 21637.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2864, pruned_loss=0.06628, over 4259088.41 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:54:32,203 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1822212.0, ans=0.125 2023-06-27 13:54:55,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1822272.0, ans=0.125 2023-06-27 13:54:57,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=12.0 2023-06-27 13:55:34,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1822392.0, ans=0.2 2023-06-27 13:56:06,666 INFO [train.py:996] (3/4) Epoch 10, batch 29300, loss[loss=0.186, simple_loss=0.2566, pruned_loss=0.0577, over 21778.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2885, pruned_loss=0.06547, over 4263928.79 frames. ], batch size: 124, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:56:13,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1822512.0, ans=0.1 2023-06-27 13:56:17,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1822512.0, ans=0.2 2023-06-27 13:56:22,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.212e+02 5.568e+02 7.846e+02 1.257e+03 2.359e+03, threshold=1.569e+03, percent-clipped=3.0 2023-06-27 13:56:40,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1822632.0, ans=0.025 2023-06-27 13:56:56,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1822692.0, ans=0.125 2023-06-27 13:57:07,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1822692.0, ans=0.125 2023-06-27 13:57:48,082 INFO [train.py:996] (3/4) Epoch 10, batch 29350, loss[loss=0.1738, simple_loss=0.2401, pruned_loss=0.05373, over 21583.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2848, pruned_loss=0.06477, over 4270875.08 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:57:58,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1822812.0, ans=0.035 2023-06-27 13:58:24,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1822932.0, ans=0.1 2023-06-27 13:58:25,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-27 13:59:06,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1822992.0, ans=0.125 2023-06-27 13:59:31,068 INFO [train.py:996] (3/4) Epoch 10, batch 29400, loss[loss=0.2577, simple_loss=0.3315, pruned_loss=0.09195, over 21473.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2836, pruned_loss=0.06343, over 4269178.39 frames. ], batch size: 508, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 13:59:47,673 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.061e+02 6.893e+02 1.012e+03 1.543e+03 3.903e+03, threshold=2.024e+03, percent-clipped=23.0 2023-06-27 14:00:25,932 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=15.0 2023-06-27 14:00:51,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1823292.0, ans=0.04949747468305833 2023-06-27 14:01:12,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=22.5 2023-06-27 14:01:12,869 INFO [train.py:996] (3/4) Epoch 10, batch 29450, loss[loss=0.2414, simple_loss=0.3175, pruned_loss=0.08266, over 21369.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2834, pruned_loss=0.06332, over 4270592.87 frames. ], batch size: 549, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 14:01:19,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1823412.0, ans=0.0 2023-06-27 14:01:35,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.79 vs. limit=5.0 2023-06-27 14:01:40,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1823472.0, ans=0.125 2023-06-27 14:02:51,892 INFO [train.py:996] (3/4) Epoch 10, batch 29500, loss[loss=0.2167, simple_loss=0.2946, pruned_loss=0.06942, over 21314.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2876, pruned_loss=0.06611, over 4271944.78 frames. ], batch size: 176, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 14:03:07,700 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.631e+02 6.865e+02 1.061e+03 1.645e+03 3.419e+03, threshold=2.123e+03, percent-clipped=12.0 2023-06-27 14:03:21,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1823772.0, ans=0.2 2023-06-27 14:03:57,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1823892.0, ans=0.2 2023-06-27 14:04:07,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1823892.0, ans=0.05 2023-06-27 14:04:33,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.39 vs. limit=15.0 2023-06-27 14:04:33,490 INFO [train.py:996] (3/4) Epoch 10, batch 29550, loss[loss=0.2123, simple_loss=0.2869, pruned_loss=0.06885, over 21847.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2872, pruned_loss=0.06766, over 4275002.07 frames. ], batch size: 332, lr: 2.87e-03, grad_scale: 16.0 2023-06-27 14:05:22,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=15.0 2023-06-27 14:06:00,283 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.86 vs. limit=10.0 2023-06-27 14:06:01,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1824252.0, ans=0.035 2023-06-27 14:06:11,156 INFO [train.py:996] (3/4) Epoch 10, batch 29600, loss[loss=0.2968, simple_loss=0.4185, pruned_loss=0.08756, over 19839.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2947, pruned_loss=0.06997, over 4279099.93 frames. ], batch size: 702, lr: 2.87e-03, grad_scale: 32.0 2023-06-27 14:06:15,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1824312.0, ans=0.02 2023-06-27 14:06:29,597 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.301e+02 5.949e+02 7.426e+02 9.960e+02 2.480e+03, threshold=1.485e+03, percent-clipped=1.0 2023-06-27 14:06:51,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.97 vs. limit=15.0 2023-06-27 14:06:56,811 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-27 14:07:32,426 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1824552.0, ans=0.1 2023-06-27 14:07:38,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1824552.0, ans=0.125 2023-06-27 14:07:43,093 INFO [train.py:996] (3/4) Epoch 10, batch 29650, loss[loss=0.1678, simple_loss=0.2512, pruned_loss=0.0422, over 21772.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2939, pruned_loss=0.06773, over 4284015.63 frames. ], batch size: 316, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:07:52,241 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=15.0 2023-06-27 14:08:59,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1824792.0, ans=0.2 2023-06-27 14:09:20,391 INFO [train.py:996] (3/4) Epoch 10, batch 29700, loss[loss=0.1924, simple_loss=0.2612, pruned_loss=0.06179, over 21148.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2932, pruned_loss=0.06748, over 4287977.14 frames. ], batch size: 608, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:09:32,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1824912.0, ans=0.125 2023-06-27 14:09:42,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.057e+02 7.297e+02 1.065e+03 1.869e+03 3.621e+03, threshold=2.131e+03, percent-clipped=32.0 2023-06-27 14:10:10,860 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:10:10,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1825032.0, ans=0.0 2023-06-27 14:10:32,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1825092.0, ans=0.125 2023-06-27 14:10:38,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1825152.0, ans=0.0 2023-06-27 14:10:45,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1825152.0, ans=0.125 2023-06-27 14:10:56,003 INFO [train.py:996] (3/4) Epoch 10, batch 29750, loss[loss=0.2317, simple_loss=0.3471, pruned_loss=0.05817, over 20870.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2988, pruned_loss=0.06786, over 4285024.55 frames. ], batch size: 607, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:11:28,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1825272.0, ans=0.0 2023-06-27 14:11:29,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1825272.0, ans=0.0 2023-06-27 14:11:34,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1825272.0, ans=0.125 2023-06-27 14:11:43,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-27 14:11:58,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1825392.0, ans=0.125 2023-06-27 14:12:14,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1825452.0, ans=0.125 2023-06-27 14:12:27,395 INFO [train.py:996] (3/4) Epoch 10, batch 29800, loss[loss=0.2472, simple_loss=0.317, pruned_loss=0.08867, over 21840.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.301, pruned_loss=0.06896, over 4294158.72 frames. ], batch size: 107, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:12:49,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1825572.0, ans=0.125 2023-06-27 14:12:51,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.238e+02 6.246e+02 9.031e+02 1.363e+03 2.753e+03, threshold=1.806e+03, percent-clipped=5.0 2023-06-27 14:13:24,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-27 14:13:51,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1825812.0, ans=0.125 2023-06-27 14:13:52,272 INFO [train.py:996] (3/4) Epoch 10, batch 29850, loss[loss=0.2002, simple_loss=0.2807, pruned_loss=0.05982, over 21775.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2966, pruned_loss=0.06698, over 4295183.30 frames. ], batch size: 414, lr: 2.86e-03, grad_scale: 8.0 2023-06-27 14:14:18,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-27 14:15:18,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1826052.0, ans=0.125 2023-06-27 14:15:18,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1826052.0, ans=0.125 2023-06-27 14:15:27,916 INFO [train.py:996] (3/4) Epoch 10, batch 29900, loss[loss=0.2559, simple_loss=0.3252, pruned_loss=0.09335, over 21817.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.295, pruned_loss=0.0687, over 4304047.21 frames. ], batch size: 441, lr: 2.86e-03, grad_scale: 8.0 2023-06-27 14:15:46,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1826112.0, ans=0.2 2023-06-27 14:15:46,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.38 vs. limit=15.0 2023-06-27 14:16:07,292 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.080e+02 5.707e+02 7.601e+02 1.173e+03 3.198e+03, threshold=1.520e+03, percent-clipped=6.0 2023-06-27 14:16:07,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1826172.0, ans=0.0 2023-06-27 14:16:11,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1826172.0, ans=0.5 2023-06-27 14:16:27,510 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.26 vs. limit=10.0 2023-06-27 14:16:52,302 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=15.0 2023-06-27 14:17:05,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1826352.0, ans=0.125 2023-06-27 14:17:11,027 INFO [train.py:996] (3/4) Epoch 10, batch 29950, loss[loss=0.2394, simple_loss=0.3165, pruned_loss=0.08115, over 21694.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2967, pruned_loss=0.07155, over 4298634.34 frames. ], batch size: 351, lr: 2.86e-03, grad_scale: 8.0 2023-06-27 14:17:19,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1826412.0, ans=0.2 2023-06-27 14:17:52,301 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-27 14:17:59,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1826532.0, ans=0.125 2023-06-27 14:18:16,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1826592.0, ans=0.125 2023-06-27 14:19:05,008 INFO [train.py:996] (3/4) Epoch 10, batch 30000, loss[loss=0.1954, simple_loss=0.2908, pruned_loss=0.04999, over 21829.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2993, pruned_loss=0.07199, over 4297171.52 frames. ], batch size: 282, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:19:05,009 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 14:19:22,076 INFO [train.py:1028] (3/4) Epoch 10, validation: loss=0.2475, simple_loss=0.3412, pruned_loss=0.07692, over 1796401.00 frames. 2023-06-27 14:19:22,077 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 14:19:37,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-27 14:19:39,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1826772.0, ans=0.0 2023-06-27 14:19:43,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.141e+02 6.862e+02 9.553e+02 1.677e+03 3.481e+03, threshold=1.911e+03, percent-clipped=29.0 2023-06-27 14:20:00,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1826832.0, ans=0.125 2023-06-27 14:21:05,352 INFO [train.py:996] (3/4) Epoch 10, batch 30050, loss[loss=0.224, simple_loss=0.3205, pruned_loss=0.06375, over 21646.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3046, pruned_loss=0.07016, over 4300251.63 frames. ], batch size: 247, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:21:38,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-27 14:21:46,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1827132.0, ans=0.1 2023-06-27 14:22:01,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1827132.0, ans=0.125 2023-06-27 14:22:12,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1827192.0, ans=0.125 2023-06-27 14:22:34,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1827252.0, ans=0.125 2023-06-27 14:22:39,172 INFO [train.py:996] (3/4) Epoch 10, batch 30100, loss[loss=0.1972, simple_loss=0.2608, pruned_loss=0.06674, over 21958.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3032, pruned_loss=0.06972, over 4295935.49 frames. ], batch size: 119, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:22:51,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-27 14:22:58,224 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.865e+02 7.541e+02 1.187e+03 1.645e+03 3.691e+03, threshold=2.374e+03, percent-clipped=12.0 2023-06-27 14:23:01,453 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-27 14:23:25,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1827432.0, ans=0.5 2023-06-27 14:24:17,468 INFO [train.py:996] (3/4) Epoch 10, batch 30150, loss[loss=0.2381, simple_loss=0.3035, pruned_loss=0.08637, over 21909.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2981, pruned_loss=0.07047, over 4293332.92 frames. ], batch size: 372, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:24:21,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1827612.0, ans=0.035 2023-06-27 14:25:11,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1827732.0, ans=0.0 2023-06-27 14:25:15,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1827732.0, ans=0.2 2023-06-27 14:25:41,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1827792.0, ans=0.5 2023-06-27 14:25:41,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1827792.0, ans=0.125 2023-06-27 14:26:02,895 INFO [train.py:996] (3/4) Epoch 10, batch 30200, loss[loss=0.2402, simple_loss=0.329, pruned_loss=0.07568, over 21423.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2995, pruned_loss=0.06999, over 4284067.13 frames. ], batch size: 131, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:26:28,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1827912.0, ans=0.125 2023-06-27 14:26:42,660 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.354e+02 6.809e+02 8.710e+02 1.204e+03 2.614e+03, threshold=1.742e+03, percent-clipped=2.0 2023-06-27 14:26:57,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1828032.0, ans=0.0 2023-06-27 14:27:06,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.94 vs. limit=22.5 2023-06-27 14:27:12,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1828092.0, ans=0.125 2023-06-27 14:28:02,214 INFO [train.py:996] (3/4) Epoch 10, batch 30250, loss[loss=0.2388, simple_loss=0.3467, pruned_loss=0.06544, over 21783.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3061, pruned_loss=0.07166, over 4278237.59 frames. ], batch size: 282, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:29:02,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1828392.0, ans=0.07 2023-06-27 14:29:19,663 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.92 vs. limit=15.0 2023-06-27 14:29:38,355 INFO [train.py:996] (3/4) Epoch 10, batch 30300, loss[loss=0.1898, simple_loss=0.2596, pruned_loss=0.05999, over 21614.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3049, pruned_loss=0.07186, over 4280824.84 frames. ], batch size: 298, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:29:46,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1828512.0, ans=15.0 2023-06-27 14:29:52,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1828512.0, ans=0.2 2023-06-27 14:30:03,861 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 6.596e+02 9.409e+02 1.315e+03 2.834e+03, threshold=1.882e+03, percent-clipped=10.0 2023-06-27 14:30:08,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-27 14:30:08,541 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-27 14:30:15,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=22.5 2023-06-27 14:31:06,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1828752.0, ans=0.0 2023-06-27 14:31:08,611 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.39 vs. limit=22.5 2023-06-27 14:31:27,450 INFO [train.py:996] (3/4) Epoch 10, batch 30350, loss[loss=0.223, simple_loss=0.2973, pruned_loss=0.07434, over 21573.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3051, pruned_loss=0.07276, over 4276142.40 frames. ], batch size: 263, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:31:45,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1828872.0, ans=0.1 2023-06-27 14:32:15,565 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:32:16,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1828992.0, ans=0.125 2023-06-27 14:32:25,124 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-27 14:32:30,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1829052.0, ans=0.09899494936611666 2023-06-27 14:32:37,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1829052.0, ans=0.125 2023-06-27 14:32:41,772 INFO [train.py:996] (3/4) Epoch 10, batch 30400, loss[loss=0.2118, simple_loss=0.2619, pruned_loss=0.0809, over 20187.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2989, pruned_loss=0.07201, over 4257823.15 frames. ], batch size: 702, lr: 2.86e-03, grad_scale: 32.0 2023-06-27 14:33:09,688 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.426e+02 7.954e+02 1.288e+03 1.926e+03 4.132e+03, threshold=2.577e+03, percent-clipped=26.0 2023-06-27 14:33:13,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1829172.0, ans=0.125 2023-06-27 14:34:05,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1829352.0, ans=0.0 2023-06-27 14:34:08,231 INFO [train.py:996] (3/4) Epoch 10, batch 30450, loss[loss=0.2655, simple_loss=0.3807, pruned_loss=0.07511, over 19872.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2998, pruned_loss=0.07148, over 4199333.44 frames. ], batch size: 702, lr: 2.86e-03, grad_scale: 16.0 2023-06-27 14:34:10,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1829412.0, ans=0.125 2023-06-27 14:34:48,240 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 14:37:28,495 INFO [train.py:996] (3/4) Epoch 11, batch 0, loss[loss=0.1911, simple_loss=0.2604, pruned_loss=0.0609, over 21592.00 frames. ], tot_loss[loss=0.1911, simple_loss=0.2604, pruned_loss=0.0609, over 21592.00 frames. ], batch size: 247, lr: 2.72e-03, grad_scale: 32.0 2023-06-27 14:37:28,495 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 14:37:44,725 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2445, simple_loss=0.3464, pruned_loss=0.07127, over 1796401.00 frames. 2023-06-27 14:37:44,726 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 14:37:55,017 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1829676.0, ans=0.0 2023-06-27 14:38:23,070 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.704e+02 1.606e+03 2.605e+03 4.493e+03 1.142e+04, threshold=5.209e+03, percent-clipped=50.0 2023-06-27 14:38:46,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1829796.0, ans=0.125 2023-06-27 14:39:03,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1829916.0, ans=0.125 2023-06-27 14:39:26,621 INFO [train.py:996] (3/4) Epoch 11, batch 50, loss[loss=0.249, simple_loss=0.367, pruned_loss=0.06555, over 21794.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2982, pruned_loss=0.06613, over 954113.43 frames. ], batch size: 282, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:40:15,968 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-06-27 14:40:40,632 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-06-27 14:40:48,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1830216.0, ans=0.0 2023-06-27 14:40:54,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1830216.0, ans=0.2 2023-06-27 14:41:08,862 INFO [train.py:996] (3/4) Epoch 11, batch 100, loss[loss=0.2407, simple_loss=0.3452, pruned_loss=0.0681, over 21734.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3121, pruned_loss=0.06841, over 1696964.34 frames. ], batch size: 332, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:41:19,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1830276.0, ans=0.0 2023-06-27 14:41:46,182 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.133e+02 5.871e+02 7.705e+02 1.160e+03 1.899e+03, threshold=1.541e+03, percent-clipped=0.0 2023-06-27 14:41:48,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1830396.0, ans=0.125 2023-06-27 14:41:48,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1830396.0, ans=0.1 2023-06-27 14:42:20,269 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-27 14:42:51,591 INFO [train.py:996] (3/4) Epoch 11, batch 150, loss[loss=0.2079, simple_loss=0.2994, pruned_loss=0.05821, over 21212.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3147, pruned_loss=0.06896, over 2266763.31 frames. ], batch size: 176, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:43:59,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1830756.0, ans=0.125 2023-06-27 14:44:06,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1830756.0, ans=0.125 2023-06-27 14:44:18,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1830816.0, ans=0.07 2023-06-27 14:44:33,946 INFO [train.py:996] (3/4) Epoch 11, batch 200, loss[loss=0.21, simple_loss=0.2845, pruned_loss=0.06771, over 21824.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3115, pruned_loss=0.06907, over 2711565.48 frames. ], batch size: 124, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:44:42,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1830876.0, ans=0.0 2023-06-27 14:44:55,073 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.19 vs. limit=22.5 2023-06-27 14:44:59,737 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-27 14:45:11,942 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.129e+02 7.270e+02 1.005e+03 1.466e+03 4.683e+03, threshold=2.010e+03, percent-clipped=22.0 2023-06-27 14:45:19,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1830996.0, ans=0.125 2023-06-27 14:45:54,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1831116.0, ans=0.125 2023-06-27 14:46:04,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1831116.0, ans=0.125 2023-06-27 14:46:18,429 INFO [train.py:996] (3/4) Epoch 11, batch 250, loss[loss=0.2076, simple_loss=0.2909, pruned_loss=0.0621, over 20770.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3096, pruned_loss=0.06836, over 3044595.66 frames. ], batch size: 608, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:46:41,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1831236.0, ans=0.0 2023-06-27 14:46:42,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=22.5 2023-06-27 14:47:33,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-27 14:48:01,889 INFO [train.py:996] (3/4) Epoch 11, batch 300, loss[loss=0.2058, simple_loss=0.2759, pruned_loss=0.06788, over 21663.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3053, pruned_loss=0.06957, over 3321632.77 frames. ], batch size: 230, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:48:05,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1831476.0, ans=0.0 2023-06-27 14:48:16,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1831476.0, ans=0.2 2023-06-27 14:48:40,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.322e+02 6.333e+02 9.156e+02 1.285e+03 2.394e+03, threshold=1.831e+03, percent-clipped=6.0 2023-06-27 14:48:43,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1831596.0, ans=0.015 2023-06-27 14:48:54,272 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-27 14:49:06,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-27 14:49:42,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1831716.0, ans=0.125 2023-06-27 14:49:43,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-27 14:49:47,624 INFO [train.py:996] (3/4) Epoch 11, batch 350, loss[loss=0.1966, simple_loss=0.2648, pruned_loss=0.06423, over 21483.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2985, pruned_loss=0.06887, over 3540775.46 frames. ], batch size: 132, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:49:51,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1831776.0, ans=0.125 2023-06-27 14:50:07,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1831836.0, ans=0.125 2023-06-27 14:50:08,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1831836.0, ans=0.125 2023-06-27 14:50:13,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1831836.0, ans=0.2 2023-06-27 14:50:27,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1831896.0, ans=0.1 2023-06-27 14:50:31,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1831896.0, ans=0.2 2023-06-27 14:50:45,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1831896.0, ans=15.0 2023-06-27 14:50:59,981 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-27 14:51:19,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1832016.0, ans=10.0 2023-06-27 14:51:30,046 INFO [train.py:996] (3/4) Epoch 11, batch 400, loss[loss=0.2502, simple_loss=0.3622, pruned_loss=0.0691, over 21620.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2933, pruned_loss=0.06601, over 3702370.37 frames. ], batch size: 441, lr: 2.72e-03, grad_scale: 32.0 2023-06-27 14:51:35,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1832076.0, ans=0.125 2023-06-27 14:51:37,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1832076.0, ans=0.125 2023-06-27 14:51:42,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1832076.0, ans=0.125 2023-06-27 14:51:47,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-27 14:52:09,625 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.211e+02 7.477e+02 1.167e+03 1.835e+03 4.227e+03, threshold=2.334e+03, percent-clipped=25.0 2023-06-27 14:52:18,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1832196.0, ans=0.125 2023-06-27 14:52:33,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1832256.0, ans=0.125 2023-06-27 14:52:53,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1832316.0, ans=0.125 2023-06-27 14:53:12,789 INFO [train.py:996] (3/4) Epoch 11, batch 450, loss[loss=0.1663, simple_loss=0.2313, pruned_loss=0.05059, over 21359.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2877, pruned_loss=0.06492, over 3826645.94 frames. ], batch size: 131, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:53:59,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-27 14:54:56,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1832676.0, ans=0.125 2023-06-27 14:54:56,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.53 vs. limit=15.0 2023-06-27 14:54:57,309 INFO [train.py:996] (3/4) Epoch 11, batch 500, loss[loss=0.2002, simple_loss=0.276, pruned_loss=0.06225, over 21774.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2869, pruned_loss=0.06437, over 3927981.46 frames. ], batch size: 124, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:55:06,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1832676.0, ans=0.125 2023-06-27 14:55:21,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1832736.0, ans=0.0 2023-06-27 14:55:37,223 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.096e+02 9.470e+02 1.676e+03 2.580e+03 4.364e+03, threshold=3.351e+03, percent-clipped=30.0 2023-06-27 14:56:15,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.65 vs. limit=15.0 2023-06-27 14:56:17,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-27 14:56:24,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1832916.0, ans=0.125 2023-06-27 14:56:39,111 INFO [train.py:996] (3/4) Epoch 11, batch 550, loss[loss=0.2244, simple_loss=0.3207, pruned_loss=0.06408, over 21313.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2893, pruned_loss=0.06354, over 4006999.99 frames. ], batch size: 176, lr: 2.72e-03, grad_scale: 16.0 2023-06-27 14:56:41,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1832976.0, ans=0.125 2023-06-27 14:56:46,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1832976.0, ans=0.125 2023-06-27 14:57:04,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1833036.0, ans=0.09899494936611666 2023-06-27 14:57:14,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-27 14:58:07,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1833216.0, ans=0.2 2023-06-27 14:58:22,141 INFO [train.py:996] (3/4) Epoch 11, batch 600, loss[loss=0.2209, simple_loss=0.3138, pruned_loss=0.06405, over 21446.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2963, pruned_loss=0.0646, over 4075036.68 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 14:59:00,929 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.101e+02 6.551e+02 9.996e+02 1.452e+03 3.285e+03, threshold=1.999e+03, percent-clipped=0.0 2023-06-27 14:59:03,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1833396.0, ans=0.125 2023-06-27 14:59:13,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-27 14:59:33,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1833456.0, ans=0.125 2023-06-27 14:59:34,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1833456.0, ans=0.0 2023-06-27 15:00:01,362 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-27 15:00:03,724 INFO [train.py:996] (3/4) Epoch 11, batch 650, loss[loss=0.2017, simple_loss=0.2769, pruned_loss=0.06327, over 21272.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.3002, pruned_loss=0.06563, over 4106833.13 frames. ], batch size: 159, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:00:41,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.85 vs. limit=22.5 2023-06-27 15:00:56,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1833696.0, ans=0.0 2023-06-27 15:01:32,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1833816.0, ans=0.125 2023-06-27 15:01:39,986 INFO [train.py:996] (3/4) Epoch 11, batch 700, loss[loss=0.2028, simple_loss=0.2843, pruned_loss=0.06068, over 21796.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2999, pruned_loss=0.06674, over 4151433.95 frames. ], batch size: 298, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:01:51,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1833876.0, ans=0.2 2023-06-27 15:01:59,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-27 15:02:26,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.643e+02 7.317e+02 1.195e+03 1.924e+03 5.182e+03, threshold=2.390e+03, percent-clipped=22.0 2023-06-27 15:03:00,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1834056.0, ans=0.0 2023-06-27 15:03:12,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1834116.0, ans=10.0 2023-06-27 15:03:26,527 INFO [train.py:996] (3/4) Epoch 11, batch 750, loss[loss=0.2086, simple_loss=0.2868, pruned_loss=0.0652, over 21476.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2979, pruned_loss=0.06815, over 4180478.13 frames. ], batch size: 212, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:03:50,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1834236.0, ans=0.1 2023-06-27 15:04:11,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1834296.0, ans=0.125 2023-06-27 15:04:45,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1834356.0, ans=0.07 2023-06-27 15:04:57,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1834416.0, ans=0.125 2023-06-27 15:05:09,838 INFO [train.py:996] (3/4) Epoch 11, batch 800, loss[loss=0.2982, simple_loss=0.3882, pruned_loss=0.1042, over 21705.00 frames. ], tot_loss[loss=0.216, simple_loss=0.295, pruned_loss=0.06844, over 4209350.75 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:05:25,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1834536.0, ans=0.2 2023-06-27 15:05:51,243 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.162e+02 6.690e+02 1.036e+03 1.625e+03 3.290e+03, threshold=2.071e+03, percent-clipped=5.0 2023-06-27 15:06:53,137 INFO [train.py:996] (3/4) Epoch 11, batch 850, loss[loss=0.2115, simple_loss=0.285, pruned_loss=0.06898, over 21856.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2938, pruned_loss=0.06973, over 4227629.04 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:07:51,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1834896.0, ans=0.125 2023-06-27 15:07:52,260 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.01 vs. limit=10.0 2023-06-27 15:08:32,830 INFO [train.py:996] (3/4) Epoch 11, batch 900, loss[loss=0.1638, simple_loss=0.2409, pruned_loss=0.0434, over 21783.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2905, pruned_loss=0.06833, over 4242535.27 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:08:39,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1835076.0, ans=10.0 2023-06-27 15:08:39,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1835076.0, ans=0.1 2023-06-27 15:09:18,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.069e+02 6.963e+02 1.051e+03 1.568e+03 3.283e+03, threshold=2.103e+03, percent-clipped=8.0 2023-06-27 15:09:26,321 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2023-06-27 15:09:32,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1835256.0, ans=0.0 2023-06-27 15:09:40,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1835256.0, ans=0.1 2023-06-27 15:09:41,004 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:10:10,466 INFO [train.py:996] (3/4) Epoch 11, batch 950, loss[loss=0.2522, simple_loss=0.3055, pruned_loss=0.09943, over 21507.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2875, pruned_loss=0.06716, over 4253319.48 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:11:02,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1835496.0, ans=0.125 2023-06-27 15:11:12,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1835496.0, ans=0.125 2023-06-27 15:11:30,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1835556.0, ans=0.07 2023-06-27 15:11:35,300 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1835616.0, ans=0.2 2023-06-27 15:11:40,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1835616.0, ans=0.125 2023-06-27 15:11:53,099 INFO [train.py:996] (3/4) Epoch 11, batch 1000, loss[loss=0.211, simple_loss=0.2802, pruned_loss=0.07087, over 21343.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2876, pruned_loss=0.06729, over 4261462.30 frames. ], batch size: 143, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:11:57,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1835676.0, ans=0.0 2023-06-27 15:12:41,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1835796.0, ans=0.0 2023-06-27 15:12:44,388 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 7.396e+02 1.258e+03 1.842e+03 3.420e+03, threshold=2.515e+03, percent-clipped=20.0 2023-06-27 15:12:45,654 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.29 vs. limit=22.5 2023-06-27 15:13:31,125 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-27 15:13:36,703 INFO [train.py:996] (3/4) Epoch 11, batch 1050, loss[loss=0.2132, simple_loss=0.2874, pruned_loss=0.0695, over 21486.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2867, pruned_loss=0.06693, over 4273817.22 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:13:43,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1835976.0, ans=0.125 2023-06-27 15:13:52,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1835976.0, ans=0.0 2023-06-27 15:14:18,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1836036.0, ans=0.0 2023-06-27 15:14:33,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-27 15:15:26,356 INFO [train.py:996] (3/4) Epoch 11, batch 1100, loss[loss=0.1858, simple_loss=0.262, pruned_loss=0.05484, over 20213.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.287, pruned_loss=0.06611, over 4274759.44 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:15:31,895 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:15:35,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1836276.0, ans=0.2 2023-06-27 15:15:45,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1836276.0, ans=0.0 2023-06-27 15:16:13,023 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.192e+02 8.562e+02 1.240e+03 1.886e+03 2.880e+03, threshold=2.480e+03, percent-clipped=5.0 2023-06-27 15:16:37,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1836456.0, ans=0.125 2023-06-27 15:16:37,939 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-27 15:16:45,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1836456.0, ans=0.1 2023-06-27 15:17:09,913 INFO [train.py:996] (3/4) Epoch 11, batch 1150, loss[loss=0.2114, simple_loss=0.2861, pruned_loss=0.06836, over 21259.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2882, pruned_loss=0.06559, over 4270736.16 frames. ], batch size: 176, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:18:31,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1836756.0, ans=0.2 2023-06-27 15:18:32,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-27 15:18:42,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1836816.0, ans=0.1 2023-06-27 15:18:53,538 INFO [train.py:996] (3/4) Epoch 11, batch 1200, loss[loss=0.2135, simple_loss=0.2988, pruned_loss=0.06409, over 21807.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2907, pruned_loss=0.06567, over 4271117.61 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 15:19:07,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1836876.0, ans=0.125 2023-06-27 15:19:47,119 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.628e+02 7.428e+02 1.142e+03 1.630e+03 3.056e+03, threshold=2.284e+03, percent-clipped=6.0 2023-06-27 15:20:04,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1837056.0, ans=0.07 2023-06-27 15:20:37,521 INFO [train.py:996] (3/4) Epoch 11, batch 1250, loss[loss=0.1952, simple_loss=0.2446, pruned_loss=0.07297, over 20256.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2927, pruned_loss=0.06603, over 4268909.47 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:21:13,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1837236.0, ans=10.0 2023-06-27 15:21:32,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1837296.0, ans=0.125 2023-06-27 15:21:34,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1837296.0, ans=0.04949747468305833 2023-06-27 15:21:56,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1837356.0, ans=0.05 2023-06-27 15:22:06,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1837416.0, ans=0.125 2023-06-27 15:22:21,879 INFO [train.py:996] (3/4) Epoch 11, batch 1300, loss[loss=0.1838, simple_loss=0.2697, pruned_loss=0.04892, over 21727.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2944, pruned_loss=0.06679, over 4278698.99 frames. ], batch size: 247, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:22:47,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-27 15:23:16,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.358e+02 6.400e+02 8.214e+02 1.269e+03 2.290e+03, threshold=1.643e+03, percent-clipped=1.0 2023-06-27 15:23:59,173 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-27 15:24:11,935 INFO [train.py:996] (3/4) Epoch 11, batch 1350, loss[loss=0.2155, simple_loss=0.292, pruned_loss=0.06949, over 21834.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2955, pruned_loss=0.06773, over 4281971.09 frames. ], batch size: 298, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:24:29,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.78 vs. limit=10.0 2023-06-27 15:24:31,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-06-27 15:24:44,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1837836.0, ans=0.0 2023-06-27 15:25:56,204 INFO [train.py:996] (3/4) Epoch 11, batch 1400, loss[loss=0.2128, simple_loss=0.2923, pruned_loss=0.06659, over 21775.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2947, pruned_loss=0.0688, over 4281974.18 frames. ], batch size: 98, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:26:46,608 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.209e+02 7.064e+02 1.087e+03 1.603e+03 3.118e+03, threshold=2.174e+03, percent-clipped=20.0 2023-06-27 15:26:49,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-27 15:27:15,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1838256.0, ans=0.0 2023-06-27 15:27:39,798 INFO [train.py:996] (3/4) Epoch 11, batch 1450, loss[loss=0.2428, simple_loss=0.3394, pruned_loss=0.07313, over 21743.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2942, pruned_loss=0.06914, over 4282496.43 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:27:55,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1838376.0, ans=0.125 2023-06-27 15:28:00,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1838436.0, ans=0.5 2023-06-27 15:28:08,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1838436.0, ans=0.125 2023-06-27 15:28:12,749 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.78 vs. limit=10.0 2023-06-27 15:28:13,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1838436.0, ans=0.125 2023-06-27 15:28:40,942 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:28:57,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1838556.0, ans=0.1 2023-06-27 15:29:02,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.35 vs. limit=15.0 2023-06-27 15:29:21,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1838616.0, ans=0.125 2023-06-27 15:29:28,845 INFO [train.py:996] (3/4) Epoch 11, batch 1500, loss[loss=0.2264, simple_loss=0.3047, pruned_loss=0.07404, over 21795.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2964, pruned_loss=0.07043, over 4284891.19 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:29:31,592 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-27 15:29:48,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1838736.0, ans=0.125 2023-06-27 15:29:51,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1838736.0, ans=0.125 2023-06-27 15:30:14,622 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.898e+02 7.080e+02 9.690e+02 1.530e+03 3.266e+03, threshold=1.938e+03, percent-clipped=8.0 2023-06-27 15:30:21,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1838796.0, ans=0.125 2023-06-27 15:30:25,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1838856.0, ans=0.125 2023-06-27 15:31:08,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1838916.0, ans=0.07 2023-06-27 15:31:14,246 INFO [train.py:996] (3/4) Epoch 11, batch 1550, loss[loss=0.1662, simple_loss=0.2474, pruned_loss=0.0425, over 21376.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2923, pruned_loss=0.06898, over 4288530.68 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 8.0 2023-06-27 15:31:20,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1838976.0, ans=0.125 2023-06-27 15:31:22,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1838976.0, ans=0.125 2023-06-27 15:31:23,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1838976.0, ans=0.2 2023-06-27 15:31:36,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1839036.0, ans=0.125 2023-06-27 15:31:59,086 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:33:01,765 INFO [train.py:996] (3/4) Epoch 11, batch 1600, loss[loss=0.1559, simple_loss=0.2192, pruned_loss=0.04629, over 21812.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2904, pruned_loss=0.06866, over 4272807.53 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:33:02,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1839276.0, ans=0.0 2023-06-27 15:33:11,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1839276.0, ans=0.125 2023-06-27 15:33:21,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1839276.0, ans=0.0 2023-06-27 15:33:53,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.998e+02 6.555e+02 8.833e+02 1.502e+03 3.809e+03, threshold=1.767e+03, percent-clipped=10.0 2023-06-27 15:34:15,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1839456.0, ans=0.125 2023-06-27 15:34:48,947 INFO [train.py:996] (3/4) Epoch 11, batch 1650, loss[loss=0.2101, simple_loss=0.2702, pruned_loss=0.07497, over 21427.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2899, pruned_loss=0.06851, over 4272084.59 frames. ], batch size: 473, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:34:59,605 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:35:27,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1839636.0, ans=0.025 2023-06-27 15:35:39,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1839696.0, ans=0.1 2023-06-27 15:36:05,093 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.41 vs. limit=15.0 2023-06-27 15:36:14,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1839756.0, ans=0.125 2023-06-27 15:36:20,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-27 15:36:37,006 INFO [train.py:996] (3/4) Epoch 11, batch 1700, loss[loss=0.2145, simple_loss=0.2981, pruned_loss=0.06541, over 21579.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2918, pruned_loss=0.0679, over 4283630.94 frames. ], batch size: 230, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:37:08,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1839936.0, ans=0.05 2023-06-27 15:37:35,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.435e+02 5.947e+02 9.216e+02 1.351e+03 2.792e+03, threshold=1.843e+03, percent-clipped=11.0 2023-06-27 15:37:39,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-27 15:37:56,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1840056.0, ans=0.0 2023-06-27 15:38:30,373 INFO [train.py:996] (3/4) Epoch 11, batch 1750, loss[loss=0.2514, simple_loss=0.3424, pruned_loss=0.08021, over 21682.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2922, pruned_loss=0.06782, over 4282709.03 frames. ], batch size: 389, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:38:36,649 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 15:39:27,236 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-27 15:39:28,735 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.77 vs. limit=10.0 2023-06-27 15:39:32,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.87 vs. limit=15.0 2023-06-27 15:39:35,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1840296.0, ans=0.125 2023-06-27 15:40:14,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-27 15:40:22,767 INFO [train.py:996] (3/4) Epoch 11, batch 1800, loss[loss=0.2136, simple_loss=0.3058, pruned_loss=0.06069, over 21688.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2916, pruned_loss=0.06655, over 4281113.10 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:41:02,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1840536.0, ans=0.1 2023-06-27 15:41:08,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1840596.0, ans=0.125 2023-06-27 15:41:09,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1840596.0, ans=0.125 2023-06-27 15:41:13,917 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 6.830e+02 1.090e+03 1.802e+03 4.605e+03, threshold=2.180e+03, percent-clipped=19.0 2023-06-27 15:41:50,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1840716.0, ans=0.125 2023-06-27 15:41:52,793 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-27 15:42:09,051 INFO [train.py:996] (3/4) Epoch 11, batch 1850, loss[loss=0.2072, simple_loss=0.2942, pruned_loss=0.0601, over 21879.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2897, pruned_loss=0.06435, over 4273858.72 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:42:10,462 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-27 15:42:11,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1840776.0, ans=0.0 2023-06-27 15:42:23,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1840776.0, ans=0.2 2023-06-27 15:42:24,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1840776.0, ans=0.2 2023-06-27 15:43:27,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1840956.0, ans=0.125 2023-06-27 15:43:53,670 INFO [train.py:996] (3/4) Epoch 11, batch 1900, loss[loss=0.1847, simple_loss=0.2689, pruned_loss=0.05026, over 21428.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2901, pruned_loss=0.06447, over 4281345.99 frames. ], batch size: 194, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:44:17,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1841076.0, ans=0.1 2023-06-27 15:44:26,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1841136.0, ans=0.1 2023-06-27 15:44:33,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1841136.0, ans=0.125 2023-06-27 15:44:39,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-27 15:44:43,247 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.131e+02 8.434e+02 1.477e+03 2.094e+03 4.159e+03, threshold=2.954e+03, percent-clipped=22.0 2023-06-27 15:45:34,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1841316.0, ans=0.1 2023-06-27 15:45:41,633 INFO [train.py:996] (3/4) Epoch 11, batch 1950, loss[loss=0.1951, simple_loss=0.2704, pruned_loss=0.05997, over 21438.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.288, pruned_loss=0.06361, over 4279567.75 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:45:54,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.57 vs. limit=15.0 2023-06-27 15:47:10,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1841616.0, ans=0.0 2023-06-27 15:47:19,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-27 15:47:26,628 INFO [train.py:996] (3/4) Epoch 11, batch 2000, loss[loss=0.1542, simple_loss=0.2251, pruned_loss=0.04171, over 21760.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2831, pruned_loss=0.06228, over 4279884.70 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 15:47:27,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1841676.0, ans=0.125 2023-06-27 15:47:47,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1841736.0, ans=0.2 2023-06-27 15:48:12,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1841796.0, ans=0.0 2023-06-27 15:48:13,536 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 7.614e+02 1.079e+03 2.039e+03 3.848e+03, threshold=2.158e+03, percent-clipped=8.0 2023-06-27 15:48:41,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1841916.0, ans=0.0 2023-06-27 15:49:09,576 INFO [train.py:996] (3/4) Epoch 11, batch 2050, loss[loss=0.1895, simple_loss=0.2573, pruned_loss=0.06085, over 21638.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2833, pruned_loss=0.06226, over 4281844.86 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:49:32,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1842036.0, ans=0.04949747468305833 2023-06-27 15:50:20,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1842156.0, ans=0.125 2023-06-27 15:50:49,027 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-27 15:50:59,226 INFO [train.py:996] (3/4) Epoch 11, batch 2100, loss[loss=0.2588, simple_loss=0.3401, pruned_loss=0.08879, over 21902.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2881, pruned_loss=0.06397, over 4283436.56 frames. ], batch size: 372, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:51:15,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1842336.0, ans=0.125 2023-06-27 15:51:26,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1842336.0, ans=0.1 2023-06-27 15:51:38,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1842396.0, ans=10.0 2023-06-27 15:51:46,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.397e+02 7.542e+02 1.130e+03 1.676e+03 4.140e+03, threshold=2.259e+03, percent-clipped=14.0 2023-06-27 15:51:54,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1842456.0, ans=0.125 2023-06-27 15:52:05,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-27 15:52:23,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1842516.0, ans=0.0 2023-06-27 15:52:44,200 INFO [train.py:996] (3/4) Epoch 11, batch 2150, loss[loss=0.1982, simple_loss=0.2677, pruned_loss=0.06434, over 21866.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2893, pruned_loss=0.0657, over 4286834.12 frames. ], batch size: 373, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:52:49,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1842576.0, ans=0.0 2023-06-27 15:53:16,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.81 vs. limit=22.5 2023-06-27 15:53:57,459 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-27 15:54:12,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1842816.0, ans=0.125 2023-06-27 15:54:29,205 INFO [train.py:996] (3/4) Epoch 11, batch 2200, loss[loss=0.2243, simple_loss=0.3377, pruned_loss=0.05546, over 19876.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2915, pruned_loss=0.06618, over 4287469.31 frames. ], batch size: 702, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:54:48,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1842936.0, ans=0.04949747468305833 2023-06-27 15:55:02,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1842996.0, ans=0.0 2023-06-27 15:55:16,678 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 6.339e+02 9.896e+02 1.686e+03 3.946e+03, threshold=1.979e+03, percent-clipped=15.0 2023-06-27 15:55:40,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1843056.0, ans=0.1 2023-06-27 15:56:14,396 INFO [train.py:996] (3/4) Epoch 11, batch 2250, loss[loss=0.2133, simple_loss=0.308, pruned_loss=0.05933, over 21459.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.291, pruned_loss=0.06525, over 4282053.45 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:56:15,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1843176.0, ans=0.1 2023-06-27 15:56:35,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1843236.0, ans=0.125 2023-06-27 15:57:02,366 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=12.0 2023-06-27 15:57:49,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1843416.0, ans=0.1 2023-06-27 15:57:52,260 INFO [train.py:996] (3/4) Epoch 11, batch 2300, loss[loss=0.1906, simple_loss=0.2545, pruned_loss=0.06338, over 21830.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2879, pruned_loss=0.0645, over 4280233.99 frames. ], batch size: 98, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:58:07,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.69 vs. limit=22.5 2023-06-27 15:58:39,377 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.760e+02 6.436e+02 1.038e+03 1.737e+03 5.031e+03, threshold=2.076e+03, percent-clipped=15.0 2023-06-27 15:58:44,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1843596.0, ans=0.0 2023-06-27 15:59:36,621 INFO [train.py:996] (3/4) Epoch 11, batch 2350, loss[loss=0.2132, simple_loss=0.2795, pruned_loss=0.0735, over 21744.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2829, pruned_loss=0.0643, over 4281203.74 frames. ], batch size: 102, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 15:59:47,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1843776.0, ans=0.1 2023-06-27 16:00:18,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1843896.0, ans=0.1 2023-06-27 16:00:29,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1843896.0, ans=0.1 2023-06-27 16:01:07,011 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:01:20,905 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:01:21,953 INFO [train.py:996] (3/4) Epoch 11, batch 2400, loss[loss=0.1931, simple_loss=0.2576, pruned_loss=0.06434, over 21527.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2832, pruned_loss=0.0659, over 4283310.95 frames. ], batch size: 391, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 16:01:26,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1844076.0, ans=0.1 2023-06-27 16:01:26,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1844076.0, ans=0.2 2023-06-27 16:01:55,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1844136.0, ans=0.0 2023-06-27 16:02:12,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1844196.0, ans=0.125 2023-06-27 16:02:21,959 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.317e+02 6.915e+02 1.084e+03 1.714e+03 3.712e+03, threshold=2.167e+03, percent-clipped=11.0 2023-06-27 16:02:50,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1844316.0, ans=0.0 2023-06-27 16:03:02,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1844316.0, ans=0.1 2023-06-27 16:03:07,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-06-27 16:03:07,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-27 16:03:07,402 INFO [train.py:996] (3/4) Epoch 11, batch 2450, loss[loss=0.2935, simple_loss=0.3589, pruned_loss=0.114, over 21471.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.29, pruned_loss=0.06966, over 4288973.20 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:04:33,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1844616.0, ans=0.2 2023-06-27 16:04:42,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1844616.0, ans=0.125 2023-06-27 16:04:49,999 INFO [train.py:996] (3/4) Epoch 11, batch 2500, loss[loss=0.2062, simple_loss=0.3078, pruned_loss=0.05233, over 21697.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2882, pruned_loss=0.07009, over 4261396.76 frames. ], batch size: 247, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:04:55,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1844676.0, ans=0.0 2023-06-27 16:04:55,958 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.47 vs. limit=15.0 2023-06-27 16:04:58,856 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:05:05,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1844736.0, ans=0.2 2023-06-27 16:05:26,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1844796.0, ans=0.125 2023-06-27 16:05:43,704 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.461e+02 7.979e+02 1.093e+03 1.704e+03 3.202e+03, threshold=2.185e+03, percent-clipped=12.0 2023-06-27 16:06:14,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1844856.0, ans=0.2 2023-06-27 16:06:34,033 INFO [train.py:996] (3/4) Epoch 11, batch 2550, loss[loss=0.2068, simple_loss=0.2819, pruned_loss=0.06581, over 21502.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2876, pruned_loss=0.06855, over 4259488.46 frames. ], batch size: 389, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:07:03,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1845036.0, ans=0.1 2023-06-27 16:07:58,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1845156.0, ans=0.2 2023-06-27 16:08:18,037 INFO [train.py:996] (3/4) Epoch 11, batch 2600, loss[loss=0.2048, simple_loss=0.3065, pruned_loss=0.05158, over 21400.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2893, pruned_loss=0.06872, over 4261725.96 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:08:19,179 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.37 vs. limit=6.0 2023-06-27 16:08:29,718 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.31 vs. limit=22.5 2023-06-27 16:08:50,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1845336.0, ans=0.1 2023-06-27 16:09:12,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.122e+02 7.338e+02 1.284e+03 1.915e+03 4.312e+03, threshold=2.567e+03, percent-clipped=18.0 2023-06-27 16:09:44,971 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1845516.0, ans=0.1 2023-06-27 16:09:58,131 INFO [train.py:996] (3/4) Epoch 11, batch 2650, loss[loss=0.1779, simple_loss=0.2355, pruned_loss=0.06014, over 21173.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2916, pruned_loss=0.06995, over 4259085.98 frames. ], batch size: 548, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:10:00,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1845576.0, ans=0.02 2023-06-27 16:11:33,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1845816.0, ans=0.125 2023-06-27 16:11:35,975 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-27 16:11:41,378 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-06-27 16:11:43,798 INFO [train.py:996] (3/4) Epoch 11, batch 2700, loss[loss=0.2024, simple_loss=0.2893, pruned_loss=0.0577, over 21787.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2906, pruned_loss=0.06905, over 4257623.91 frames. ], batch size: 282, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:12:43,563 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.729e+02 6.625e+02 9.246e+02 1.409e+03 2.648e+03, threshold=1.849e+03, percent-clipped=2.0 2023-06-27 16:13:28,865 INFO [train.py:996] (3/4) Epoch 11, batch 2750, loss[loss=0.2076, simple_loss=0.2889, pruned_loss=0.06318, over 21503.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2897, pruned_loss=0.06922, over 4269009.29 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:14:53,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1846356.0, ans=0.125 2023-06-27 16:15:15,718 INFO [train.py:996] (3/4) Epoch 11, batch 2800, loss[loss=0.2451, simple_loss=0.3351, pruned_loss=0.07758, over 21818.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2943, pruned_loss=0.06986, over 4274285.57 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 32.0 2023-06-27 16:15:26,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1846476.0, ans=0.0 2023-06-27 16:16:18,332 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.904e+02 7.981e+02 1.210e+03 1.745e+03 3.756e+03, threshold=2.419e+03, percent-clipped=24.0 2023-06-27 16:16:29,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1846656.0, ans=0.2 2023-06-27 16:17:03,351 INFO [train.py:996] (3/4) Epoch 11, batch 2850, loss[loss=0.1539, simple_loss=0.2126, pruned_loss=0.04761, over 21146.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2949, pruned_loss=0.07057, over 4274494.56 frames. ], batch size: 143, lr: 2.71e-03, grad_scale: 16.0 2023-06-27 16:17:10,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1846776.0, ans=0.0 2023-06-27 16:17:53,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1846896.0, ans=0.125 2023-06-27 16:17:53,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1846896.0, ans=0.1 2023-06-27 16:17:55,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1846896.0, ans=0.0 2023-06-27 16:17:55,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1846896.0, ans=0.125 2023-06-27 16:17:57,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1846896.0, ans=0.0 2023-06-27 16:18:27,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1847016.0, ans=0.125 2023-06-27 16:18:28,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1847016.0, ans=0.125 2023-06-27 16:18:41,445 INFO [train.py:996] (3/4) Epoch 11, batch 2900, loss[loss=0.2104, simple_loss=0.2781, pruned_loss=0.07137, over 21680.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2934, pruned_loss=0.07092, over 4272402.67 frames. ], batch size: 230, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:18:52,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1847076.0, ans=0.0 2023-06-27 16:19:09,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1847136.0, ans=0.0 2023-06-27 16:19:42,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1847196.0, ans=0.1 2023-06-27 16:19:43,471 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.69 vs. limit=15.0 2023-06-27 16:19:45,511 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.568e+02 6.840e+02 9.553e+02 1.645e+03 3.808e+03, threshold=1.911e+03, percent-clipped=8.0 2023-06-27 16:20:05,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-27 16:20:10,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1847316.0, ans=0.125 2023-06-27 16:20:17,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1847316.0, ans=0.0 2023-06-27 16:20:25,239 INFO [train.py:996] (3/4) Epoch 11, batch 2950, loss[loss=0.2016, simple_loss=0.295, pruned_loss=0.05413, over 21457.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2928, pruned_loss=0.07009, over 4278960.93 frames. ], batch size: 194, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:20:25,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1847376.0, ans=0.0 2023-06-27 16:22:14,916 INFO [train.py:996] (3/4) Epoch 11, batch 3000, loss[loss=0.2569, simple_loss=0.3393, pruned_loss=0.08722, over 21424.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2966, pruned_loss=0.06936, over 4273069.88 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:22:14,917 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 16:22:35,497 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2528, simple_loss=0.3433, pruned_loss=0.08109, over 1796401.00 frames. 2023-06-27 16:22:35,498 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 16:22:47,488 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-27 16:23:05,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1847736.0, ans=0.125 2023-06-27 16:23:26,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1847796.0, ans=0.0 2023-06-27 16:23:27,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.016e+02 6.559e+02 9.881e+02 1.581e+03 3.511e+03, threshold=1.976e+03, percent-clipped=15.0 2023-06-27 16:23:29,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1847796.0, ans=0.0 2023-06-27 16:23:31,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1847856.0, ans=0.2 2023-06-27 16:23:41,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1847856.0, ans=0.0 2023-06-27 16:23:48,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1847916.0, ans=0.125 2023-06-27 16:24:16,758 INFO [train.py:996] (3/4) Epoch 11, batch 3050, loss[loss=0.1911, simple_loss=0.2942, pruned_loss=0.04403, over 21723.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2973, pruned_loss=0.06855, over 4275806.74 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:24:37,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1847976.0, ans=0.04949747468305833 2023-06-27 16:24:53,223 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1848036.0, ans=0.1 2023-06-27 16:25:19,026 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 16:25:22,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1848156.0, ans=0.125 2023-06-27 16:25:39,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-27 16:26:00,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1848216.0, ans=0.0 2023-06-27 16:26:03,793 INFO [train.py:996] (3/4) Epoch 11, batch 3100, loss[loss=0.1781, simple_loss=0.2654, pruned_loss=0.04542, over 21518.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2965, pruned_loss=0.06769, over 4279922.39 frames. ], batch size: 195, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:26:54,553 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.218e+02 9.868e+02 1.604e+03 2.316e+03 3.970e+03, threshold=3.207e+03, percent-clipped=39.0 2023-06-27 16:27:32,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1848516.0, ans=0.125 2023-06-27 16:27:54,309 INFO [train.py:996] (3/4) Epoch 11, batch 3150, loss[loss=0.247, simple_loss=0.3345, pruned_loss=0.07973, over 21630.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2968, pruned_loss=0.06728, over 4281395.34 frames. ], batch size: 414, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:28:02,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-27 16:28:14,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1848636.0, ans=0.125 2023-06-27 16:29:05,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1848756.0, ans=0.125 2023-06-27 16:29:40,789 INFO [train.py:996] (3/4) Epoch 11, batch 3200, loss[loss=0.1999, simple_loss=0.2881, pruned_loss=0.05582, over 21804.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2988, pruned_loss=0.06789, over 4280590.71 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 16:30:38,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1848996.0, ans=0.2 2023-06-27 16:30:42,997 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.812e+02 8.312e+02 1.188e+03 1.817e+03 3.495e+03, threshold=2.376e+03, percent-clipped=3.0 2023-06-27 16:31:14,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1849116.0, ans=0.0 2023-06-27 16:31:18,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1849116.0, ans=0.1 2023-06-27 16:31:25,350 INFO [train.py:996] (3/4) Epoch 11, batch 3250, loss[loss=0.224, simple_loss=0.3051, pruned_loss=0.0715, over 16117.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3013, pruned_loss=0.07036, over 4272674.91 frames. ], batch size: 60, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:31:46,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1849236.0, ans=0.1 2023-06-27 16:32:14,410 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-27 16:32:39,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1849356.0, ans=0.05 2023-06-27 16:33:06,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1849416.0, ans=0.125 2023-06-27 16:33:11,155 INFO [train.py:996] (3/4) Epoch 11, batch 3300, loss[loss=0.2545, simple_loss=0.3402, pruned_loss=0.08443, over 21415.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2997, pruned_loss=0.07034, over 4266539.31 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:33:23,907 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1849476.0, ans=0.125 2023-06-27 16:33:44,738 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-27 16:33:45,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1849536.0, ans=0.0 2023-06-27 16:34:15,386 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.316e+02 6.747e+02 1.095e+03 2.044e+03 4.676e+03, threshold=2.190e+03, percent-clipped=15.0 2023-06-27 16:34:17,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1849656.0, ans=0.0 2023-06-27 16:34:32,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1849716.0, ans=0.0 2023-06-27 16:34:50,721 INFO [train.py:996] (3/4) Epoch 11, batch 3350, loss[loss=0.2432, simple_loss=0.3067, pruned_loss=0.08986, over 21641.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3013, pruned_loss=0.07021, over 4271463.35 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:35:21,650 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-27 16:36:28,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1850016.0, ans=0.0 2023-06-27 16:36:35,660 INFO [train.py:996] (3/4) Epoch 11, batch 3400, loss[loss=0.2107, simple_loss=0.2927, pruned_loss=0.06432, over 21534.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3016, pruned_loss=0.07125, over 4279975.35 frames. ], batch size: 389, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:37:43,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 6.438e+02 9.614e+02 1.434e+03 2.571e+03, threshold=1.923e+03, percent-clipped=1.0 2023-06-27 16:38:24,830 INFO [train.py:996] (3/4) Epoch 11, batch 3450, loss[loss=0.209, simple_loss=0.2716, pruned_loss=0.07317, over 21598.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2959, pruned_loss=0.0699, over 4275032.05 frames. ], batch size: 393, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:39:04,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1850436.0, ans=0.2 2023-06-27 16:40:15,589 INFO [train.py:996] (3/4) Epoch 11, batch 3500, loss[loss=0.2014, simple_loss=0.2675, pruned_loss=0.06762, over 21497.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3015, pruned_loss=0.07246, over 4269917.51 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:40:55,162 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-27 16:40:59,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1850796.0, ans=0.5 2023-06-27 16:41:14,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.683e+02 8.214e+02 1.340e+03 2.218e+03 5.014e+03, threshold=2.681e+03, percent-clipped=29.0 2023-06-27 16:41:54,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1850916.0, ans=0.2 2023-06-27 16:42:05,012 INFO [train.py:996] (3/4) Epoch 11, batch 3550, loss[loss=0.2431, simple_loss=0.307, pruned_loss=0.08956, over 21327.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3037, pruned_loss=0.07386, over 4273567.53 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 8.0 2023-06-27 16:42:10,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1850976.0, ans=0.125 2023-06-27 16:42:11,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-27 16:42:13,081 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.44 vs. limit=15.0 2023-06-27 16:42:44,772 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-27 16:43:39,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1851216.0, ans=0.1 2023-06-27 16:43:49,621 INFO [train.py:996] (3/4) Epoch 11, batch 3600, loss[loss=0.2157, simple_loss=0.2834, pruned_loss=0.07395, over 21788.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2992, pruned_loss=0.07384, over 4272481.71 frames. ], batch size: 98, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:44:08,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1851276.0, ans=0.0 2023-06-27 16:44:31,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1851396.0, ans=0.0 2023-06-27 16:44:44,960 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.210e+02 6.431e+02 1.048e+03 1.688e+03 3.904e+03, threshold=2.095e+03, percent-clipped=4.0 2023-06-27 16:44:47,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1851456.0, ans=0.1 2023-06-27 16:45:36,112 INFO [train.py:996] (3/4) Epoch 11, batch 3650, loss[loss=0.2144, simple_loss=0.2891, pruned_loss=0.06988, over 21609.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2989, pruned_loss=0.07348, over 4277904.24 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:45:39,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1851576.0, ans=0.1 2023-06-27 16:46:32,346 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.49 vs. limit=15.0 2023-06-27 16:47:03,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1851816.0, ans=0.125 2023-06-27 16:47:19,915 INFO [train.py:996] (3/4) Epoch 11, batch 3700, loss[loss=0.2238, simple_loss=0.3012, pruned_loss=0.07323, over 21359.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2962, pruned_loss=0.07179, over 4278346.32 frames. ], batch size: 159, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:47:49,526 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.16 vs. limit=10.0 2023-06-27 16:47:51,021 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-27 16:48:02,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1851996.0, ans=0.0 2023-06-27 16:48:13,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.881e+02 6.744e+02 1.016e+03 1.702e+03 3.129e+03, threshold=2.032e+03, percent-clipped=14.0 2023-06-27 16:48:44,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=1852116.0, ans=0.2 2023-06-27 16:48:52,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1852116.0, ans=15.0 2023-06-27 16:49:04,967 INFO [train.py:996] (3/4) Epoch 11, batch 3750, loss[loss=0.176, simple_loss=0.2452, pruned_loss=0.05337, over 21249.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2963, pruned_loss=0.07229, over 4282531.08 frames. ], batch size: 159, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:49:09,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1852176.0, ans=15.0 2023-06-27 16:49:39,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1852236.0, ans=0.125 2023-06-27 16:50:20,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1852356.0, ans=0.125 2023-06-27 16:50:27,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1852416.0, ans=0.2 2023-06-27 16:50:30,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1852416.0, ans=0.125 2023-06-27 16:50:49,290 INFO [train.py:996] (3/4) Epoch 11, batch 3800, loss[loss=0.1991, simple_loss=0.2752, pruned_loss=0.06152, over 21693.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2939, pruned_loss=0.07075, over 4283801.69 frames. ], batch size: 112, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:50:49,999 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1852476.0, ans=0.125 2023-06-27 16:50:51,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1852476.0, ans=0.125 2023-06-27 16:50:53,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1852476.0, ans=0.0 2023-06-27 16:51:13,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1852536.0, ans=0.125 2023-06-27 16:51:22,376 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=15.0 2023-06-27 16:51:47,703 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.392e+02 7.087e+02 9.540e+02 1.301e+03 2.936e+03, threshold=1.908e+03, percent-clipped=6.0 2023-06-27 16:51:56,160 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1852656.0, ans=0.0 2023-06-27 16:51:56,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1852656.0, ans=0.0 2023-06-27 16:51:56,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1852656.0, ans=0.125 2023-06-27 16:52:14,288 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.94 vs. limit=15.0 2023-06-27 16:52:25,356 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.52 vs. limit=15.0 2023-06-27 16:52:32,367 INFO [train.py:996] (3/4) Epoch 11, batch 3850, loss[loss=0.1865, simple_loss=0.2538, pruned_loss=0.05966, over 21657.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2908, pruned_loss=0.07046, over 4273856.51 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:53:14,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1852896.0, ans=0.2 2023-06-27 16:53:19,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1852896.0, ans=0.025 2023-06-27 16:54:00,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1853016.0, ans=0.125 2023-06-27 16:54:13,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1853076.0, ans=0.0 2023-06-27 16:54:14,663 INFO [train.py:996] (3/4) Epoch 11, batch 3900, loss[loss=0.1995, simple_loss=0.2684, pruned_loss=0.06526, over 21842.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2863, pruned_loss=0.07032, over 4271560.27 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:54:21,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1853076.0, ans=0.125 2023-06-27 16:54:45,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1853136.0, ans=0.0 2023-06-27 16:54:56,854 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-27 16:55:09,164 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.347e+02 6.134e+02 8.883e+02 1.369e+03 3.236e+03, threshold=1.777e+03, percent-clipped=7.0 2023-06-27 16:55:21,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1853256.0, ans=0.0 2023-06-27 16:55:54,568 INFO [train.py:996] (3/4) Epoch 11, batch 3950, loss[loss=0.1695, simple_loss=0.2604, pruned_loss=0.03927, over 21406.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2873, pruned_loss=0.06916, over 4277391.21 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 16:56:05,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1853376.0, ans=0.0 2023-06-27 16:56:19,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1853436.0, ans=0.125 2023-06-27 16:56:52,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1853496.0, ans=0.05 2023-06-27 16:56:54,114 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-27 16:57:05,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1853556.0, ans=0.2 2023-06-27 16:57:32,924 INFO [train.py:996] (3/4) Epoch 11, batch 4000, loss[loss=0.2133, simple_loss=0.2899, pruned_loss=0.06831, over 21948.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2869, pruned_loss=0.06761, over 4274781.60 frames. ], batch size: 103, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 16:57:51,843 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-27 16:58:31,942 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-27 16:58:37,194 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.700e+02 6.762e+02 1.217e+03 2.027e+03 5.671e+03, threshold=2.434e+03, percent-clipped=30.0 2023-06-27 16:59:06,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1853916.0, ans=0.125 2023-06-27 16:59:09,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1853916.0, ans=0.125 2023-06-27 16:59:17,769 INFO [train.py:996] (3/4) Epoch 11, batch 4050, loss[loss=0.2467, simple_loss=0.3252, pruned_loss=0.08413, over 21409.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2858, pruned_loss=0.06584, over 4267020.47 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 16:59:28,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1853976.0, ans=22.5 2023-06-27 16:59:46,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1854036.0, ans=0.2 2023-06-27 17:01:01,309 INFO [train.py:996] (3/4) Epoch 11, batch 4100, loss[loss=0.1945, simple_loss=0.271, pruned_loss=0.05895, over 21905.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2862, pruned_loss=0.06638, over 4270371.71 frames. ], batch size: 316, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:01:34,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1854336.0, ans=0.2 2023-06-27 17:02:00,300 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 17:02:11,185 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.335e+02 7.301e+02 1.093e+03 1.524e+03 3.311e+03, threshold=2.186e+03, percent-clipped=4.0 2023-06-27 17:02:13,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1854456.0, ans=0.2 2023-06-27 17:02:45,149 INFO [train.py:996] (3/4) Epoch 11, batch 4150, loss[loss=0.168, simple_loss=0.2632, pruned_loss=0.03645, over 21677.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2864, pruned_loss=0.06386, over 4273544.64 frames. ], batch size: 247, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:02:51,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1854576.0, ans=0.0 2023-06-27 17:03:04,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1854576.0, ans=0.125 2023-06-27 17:03:28,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1854696.0, ans=0.0 2023-06-27 17:03:59,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1854756.0, ans=0.04949747468305833 2023-06-27 17:04:17,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1854816.0, ans=0.125 2023-06-27 17:04:27,386 INFO [train.py:996] (3/4) Epoch 11, batch 4200, loss[loss=0.1939, simple_loss=0.2838, pruned_loss=0.05195, over 19914.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2867, pruned_loss=0.06296, over 4275143.22 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:04:35,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1854876.0, ans=0.0 2023-06-27 17:04:56,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-27 17:05:12,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1854936.0, ans=0.0 2023-06-27 17:05:21,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1854996.0, ans=0.2 2023-06-27 17:05:34,650 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.415e+02 6.010e+02 8.416e+02 1.376e+03 4.083e+03, threshold=1.683e+03, percent-clipped=10.0 2023-06-27 17:05:43,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1855056.0, ans=0.025 2023-06-27 17:06:09,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1855116.0, ans=0.1 2023-06-27 17:06:14,235 INFO [train.py:996] (3/4) Epoch 11, batch 4250, loss[loss=0.2781, simple_loss=0.345, pruned_loss=0.1056, over 21781.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2914, pruned_loss=0.06478, over 4276479.28 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:06:35,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=15.0 2023-06-27 17:06:42,544 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.34 vs. limit=15.0 2023-06-27 17:07:16,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1855356.0, ans=0.125 2023-06-27 17:07:23,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1855356.0, ans=0.125 2023-06-27 17:07:29,348 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-27 17:08:00,642 INFO [train.py:996] (3/4) Epoch 11, batch 4300, loss[loss=0.2207, simple_loss=0.2863, pruned_loss=0.07754, over 21535.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2973, pruned_loss=0.06701, over 4276470.63 frames. ], batch size: 548, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:08:08,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1855476.0, ans=0.0 2023-06-27 17:08:11,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1855476.0, ans=0.1 2023-06-27 17:08:20,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.63 vs. limit=15.0 2023-06-27 17:08:37,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1855596.0, ans=0.125 2023-06-27 17:08:51,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1855596.0, ans=0.125 2023-06-27 17:08:55,702 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.494e+02 7.202e+02 1.029e+03 1.570e+03 4.728e+03, threshold=2.058e+03, percent-clipped=18.0 2023-06-27 17:09:36,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1855716.0, ans=0.2 2023-06-27 17:09:39,120 INFO [train.py:996] (3/4) Epoch 11, batch 4350, loss[loss=0.1963, simple_loss=0.2718, pruned_loss=0.0604, over 21768.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2974, pruned_loss=0.06689, over 4276363.27 frames. ], batch size: 371, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:09:53,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=12.0 2023-06-27 17:09:56,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1855776.0, ans=0.1 2023-06-27 17:10:15,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1855836.0, ans=0.125 2023-06-27 17:10:22,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1855896.0, ans=0.125 2023-06-27 17:10:52,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1855956.0, ans=0.05 2023-06-27 17:10:54,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1855956.0, ans=0.125 2023-06-27 17:10:56,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1855956.0, ans=0.0 2023-06-27 17:11:21,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1856016.0, ans=0.2 2023-06-27 17:11:29,245 INFO [train.py:996] (3/4) Epoch 11, batch 4400, loss[loss=0.1942, simple_loss=0.2795, pruned_loss=0.0545, over 21356.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2932, pruned_loss=0.06564, over 4272238.98 frames. ], batch size: 160, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 17:11:38,850 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1856076.0, ans=0.125 2023-06-27 17:12:07,395 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-27 17:12:32,722 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.456e+02 7.940e+02 1.162e+03 1.682e+03 5.044e+03, threshold=2.325e+03, percent-clipped=15.0 2023-06-27 17:13:05,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1856316.0, ans=0.125 2023-06-27 17:13:14,918 INFO [train.py:996] (3/4) Epoch 11, batch 4450, loss[loss=0.235, simple_loss=0.3348, pruned_loss=0.06759, over 21832.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.3007, pruned_loss=0.06725, over 4277023.58 frames. ], batch size: 316, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:13:17,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1856376.0, ans=0.125 2023-06-27 17:13:30,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1856436.0, ans=0.1 2023-06-27 17:13:30,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1856436.0, ans=0.2 2023-06-27 17:14:58,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1856676.0, ans=0.2 2023-06-27 17:14:59,762 INFO [train.py:996] (3/4) Epoch 11, batch 4500, loss[loss=0.2204, simple_loss=0.3154, pruned_loss=0.06274, over 21609.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.3028, pruned_loss=0.06893, over 4285018.20 frames. ], batch size: 230, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:15:08,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1856676.0, ans=0.125 2023-06-27 17:15:09,319 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.77 vs. limit=10.0 2023-06-27 17:15:42,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1856796.0, ans=0.125 2023-06-27 17:16:01,162 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.839e+02 8.296e+02 1.426e+03 1.842e+03 5.527e+03, threshold=2.851e+03, percent-clipped=18.0 2023-06-27 17:16:38,269 INFO [train.py:996] (3/4) Epoch 11, batch 4550, loss[loss=0.2714, simple_loss=0.3802, pruned_loss=0.08128, over 21234.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3051, pruned_loss=0.0692, over 4287534.75 frames. ], batch size: 549, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:16:50,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1856976.0, ans=0.015 2023-06-27 17:17:33,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1857096.0, ans=0.1 2023-06-27 17:17:33,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-27 17:18:21,927 INFO [train.py:996] (3/4) Epoch 11, batch 4600, loss[loss=0.2399, simple_loss=0.3468, pruned_loss=0.0665, over 17094.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3066, pruned_loss=0.07051, over 4286711.58 frames. ], batch size: 61, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:18:32,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1857276.0, ans=0.125 2023-06-27 17:18:37,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1857336.0, ans=0.2 2023-06-27 17:19:24,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1857396.0, ans=0.125 2023-06-27 17:19:33,355 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.579e+02 7.640e+02 1.105e+03 1.523e+03 3.294e+03, threshold=2.209e+03, percent-clipped=1.0 2023-06-27 17:19:35,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1857456.0, ans=0.0 2023-06-27 17:20:05,571 INFO [train.py:996] (3/4) Epoch 11, batch 4650, loss[loss=0.1606, simple_loss=0.2434, pruned_loss=0.03884, over 21744.00 frames. ], tot_loss[loss=0.219, simple_loss=0.3009, pruned_loss=0.06858, over 4285763.09 frames. ], batch size: 298, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:20:24,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1857576.0, ans=0.125 2023-06-27 17:20:37,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1857636.0, ans=0.0 2023-06-27 17:20:37,607 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1857636.0, ans=0.125 2023-06-27 17:21:49,624 INFO [train.py:996] (3/4) Epoch 11, batch 4700, loss[loss=0.1648, simple_loss=0.2362, pruned_loss=0.04668, over 21591.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2933, pruned_loss=0.06627, over 4277030.29 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:22:24,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1857936.0, ans=0.125 2023-06-27 17:22:25,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-27 17:22:59,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-27 17:22:59,956 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.795e+02 6.918e+02 1.097e+03 1.707e+03 4.002e+03, threshold=2.193e+03, percent-clipped=11.0 2023-06-27 17:23:07,637 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.62 vs. limit=15.0 2023-06-27 17:23:08,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1858056.0, ans=0.125 2023-06-27 17:23:31,327 INFO [train.py:996] (3/4) Epoch 11, batch 4750, loss[loss=0.2089, simple_loss=0.2733, pruned_loss=0.07225, over 21579.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2889, pruned_loss=0.06653, over 4283329.59 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:23:31,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1858176.0, ans=0.125 2023-06-27 17:23:37,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1858176.0, ans=0.125 2023-06-27 17:23:50,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1858176.0, ans=0.125 2023-06-27 17:25:03,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1858416.0, ans=0.0 2023-06-27 17:25:03,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1858416.0, ans=0.0 2023-06-27 17:25:05,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1858416.0, ans=0.2 2023-06-27 17:25:20,792 INFO [train.py:996] (3/4) Epoch 11, batch 4800, loss[loss=0.2041, simple_loss=0.3112, pruned_loss=0.04855, over 21786.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2885, pruned_loss=0.06729, over 4283844.33 frames. ], batch size: 332, lr: 2.70e-03, grad_scale: 32.0 2023-06-27 17:25:25,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1858476.0, ans=0.0 2023-06-27 17:26:03,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1858596.0, ans=0.1 2023-06-27 17:26:23,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1858596.0, ans=0.0 2023-06-27 17:26:28,654 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.577e+02 8.092e+02 1.102e+03 1.736e+03 3.587e+03, threshold=2.204e+03, percent-clipped=14.0 2023-06-27 17:26:34,925 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-27 17:27:03,217 INFO [train.py:996] (3/4) Epoch 11, batch 4850, loss[loss=0.1964, simple_loss=0.2708, pruned_loss=0.061, over 21640.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2886, pruned_loss=0.06707, over 4282447.78 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:27:03,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1858776.0, ans=0.0 2023-06-27 17:27:12,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1858776.0, ans=0.1 2023-06-27 17:28:11,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1858956.0, ans=0.125 2023-06-27 17:28:41,954 INFO [train.py:996] (3/4) Epoch 11, batch 4900, loss[loss=0.2056, simple_loss=0.2854, pruned_loss=0.06289, over 21872.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2912, pruned_loss=0.06787, over 4288588.92 frames. ], batch size: 124, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:29:03,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1859076.0, ans=0.125 2023-06-27 17:29:39,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1859196.0, ans=0.2 2023-06-27 17:29:56,063 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.319e+02 7.657e+02 1.361e+03 1.915e+03 3.497e+03, threshold=2.723e+03, percent-clipped=17.0 2023-06-27 17:30:16,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1859316.0, ans=0.0 2023-06-27 17:30:31,185 INFO [train.py:996] (3/4) Epoch 11, batch 4950, loss[loss=0.2169, simple_loss=0.3152, pruned_loss=0.05932, over 21420.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2939, pruned_loss=0.06645, over 4275913.43 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:31:06,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1859436.0, ans=0.0 2023-06-27 17:31:09,091 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-27 17:31:46,886 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-06-27 17:32:10,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-27 17:32:14,062 INFO [train.py:996] (3/4) Epoch 11, batch 5000, loss[loss=0.1881, simple_loss=0.2794, pruned_loss=0.04839, over 21854.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2931, pruned_loss=0.06338, over 4281901.31 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:32:37,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1859676.0, ans=0.125 2023-06-27 17:32:42,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1859736.0, ans=0.125 2023-06-27 17:33:20,249 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.265e+02 5.938e+02 8.345e+02 1.344e+03 2.733e+03, threshold=1.669e+03, percent-clipped=1.0 2023-06-27 17:33:50,174 INFO [train.py:996] (3/4) Epoch 11, batch 5050, loss[loss=0.2589, simple_loss=0.3116, pruned_loss=0.1031, over 21797.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2936, pruned_loss=0.06446, over 4288667.21 frames. ], batch size: 508, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:33:56,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1859976.0, ans=0.125 2023-06-27 17:34:37,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1860096.0, ans=0.0 2023-06-27 17:35:04,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-27 17:35:09,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-27 17:35:33,540 INFO [train.py:996] (3/4) Epoch 11, batch 5100, loss[loss=0.1773, simple_loss=0.2592, pruned_loss=0.04768, over 21693.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2913, pruned_loss=0.06504, over 4294806.70 frames. ], batch size: 263, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:35:43,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1860276.0, ans=0.2 2023-06-27 17:35:47,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1860276.0, ans=0.0 2023-06-27 17:36:47,577 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.357e+02 6.699e+02 8.715e+02 1.182e+03 3.007e+03, threshold=1.743e+03, percent-clipped=11.0 2023-06-27 17:37:23,083 INFO [train.py:996] (3/4) Epoch 11, batch 5150, loss[loss=0.2334, simple_loss=0.2981, pruned_loss=0.0844, over 21352.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2884, pruned_loss=0.0659, over 4296095.77 frames. ], batch size: 144, lr: 2.70e-03, grad_scale: 16.0 2023-06-27 17:38:32,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1860756.0, ans=0.0 2023-06-27 17:39:12,462 INFO [train.py:996] (3/4) Epoch 11, batch 5200, loss[loss=0.1798, simple_loss=0.2597, pruned_loss=0.0499, over 19981.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2904, pruned_loss=0.06638, over 4288179.89 frames. ], batch size: 703, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 17:39:29,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1860876.0, ans=0.035 2023-06-27 17:39:51,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1860936.0, ans=0.1 2023-06-27 17:40:04,854 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1860996.0, ans=0.125 2023-06-27 17:40:17,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.467e+02 7.745e+02 1.179e+03 1.665e+03 4.294e+03, threshold=2.357e+03, percent-clipped=21.0 2023-06-27 17:40:58,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1861116.0, ans=0.0 2023-06-27 17:41:00,885 INFO [train.py:996] (3/4) Epoch 11, batch 5250, loss[loss=0.1835, simple_loss=0.2841, pruned_loss=0.04144, over 19879.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2931, pruned_loss=0.06533, over 4277988.80 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:41:21,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1861236.0, ans=0.125 2023-06-27 17:41:50,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1861296.0, ans=0.2 2023-06-27 17:41:51,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1861296.0, ans=0.125 2023-06-27 17:42:15,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-27 17:42:40,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1861476.0, ans=0.125 2023-06-27 17:42:40,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-27 17:42:41,256 INFO [train.py:996] (3/4) Epoch 11, batch 5300, loss[loss=0.1758, simple_loss=0.2328, pruned_loss=0.05945, over 20206.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2916, pruned_loss=0.06544, over 4280294.60 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:43:01,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1861536.0, ans=0.125 2023-06-27 17:43:28,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1861596.0, ans=0.125 2023-06-27 17:43:39,370 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.786e+02 7.949e+02 1.214e+03 1.979e+03 3.974e+03, threshold=2.428e+03, percent-clipped=14.0 2023-06-27 17:43:39,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1861656.0, ans=0.0 2023-06-27 17:44:21,751 INFO [train.py:996] (3/4) Epoch 11, batch 5350, loss[loss=0.2111, simple_loss=0.3029, pruned_loss=0.05965, over 19908.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2908, pruned_loss=0.06696, over 4284123.43 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:44:42,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1861836.0, ans=0.125 2023-06-27 17:45:58,055 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1862016.0, ans=0.125 2023-06-27 17:46:05,870 INFO [train.py:996] (3/4) Epoch 11, batch 5400, loss[loss=0.2215, simple_loss=0.2865, pruned_loss=0.07824, over 21388.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2908, pruned_loss=0.06799, over 4282703.87 frames. ], batch size: 144, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:46:09,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.63 vs. limit=6.0 2023-06-27 17:46:14,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-27 17:46:16,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1862076.0, ans=0.04949747468305833 2023-06-27 17:46:36,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1862136.0, ans=0.0 2023-06-27 17:47:07,284 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.410e+02 6.450e+02 1.066e+03 1.376e+03 3.123e+03, threshold=2.132e+03, percent-clipped=3.0 2023-06-27 17:47:27,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1862316.0, ans=0.125 2023-06-27 17:47:50,482 INFO [train.py:996] (3/4) Epoch 11, batch 5450, loss[loss=0.2382, simple_loss=0.3502, pruned_loss=0.06314, over 21410.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2931, pruned_loss=0.06623, over 4276633.97 frames. ], batch size: 211, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:48:16,805 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-27 17:48:28,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1862496.0, ans=0.125 2023-06-27 17:48:42,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1862496.0, ans=0.125 2023-06-27 17:49:40,238 INFO [train.py:996] (3/4) Epoch 11, batch 5500, loss[loss=0.2352, simple_loss=0.3327, pruned_loss=0.06887, over 21690.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2974, pruned_loss=0.06375, over 4275674.82 frames. ], batch size: 389, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:50:22,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1862796.0, ans=0.0 2023-06-27 17:50:48,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1862856.0, ans=0.09899494936611666 2023-06-27 17:50:49,747 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.051e+02 7.572e+02 1.528e+03 2.313e+03 5.179e+03, threshold=3.055e+03, percent-clipped=29.0 2023-06-27 17:50:59,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1862856.0, ans=0.0 2023-06-27 17:51:08,226 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-06-27 17:51:24,524 INFO [train.py:996] (3/4) Epoch 11, batch 5550, loss[loss=0.1787, simple_loss=0.2684, pruned_loss=0.04454, over 21585.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2966, pruned_loss=0.06166, over 4272010.14 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:51:33,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1862976.0, ans=0.125 2023-06-27 17:51:54,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1863036.0, ans=0.125 2023-06-27 17:52:02,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1863036.0, ans=0.125 2023-06-27 17:52:21,756 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-27 17:52:39,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1863156.0, ans=0.125 2023-06-27 17:53:04,453 INFO [train.py:996] (3/4) Epoch 11, batch 5600, loss[loss=0.2112, simple_loss=0.3406, pruned_loss=0.04085, over 19696.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2954, pruned_loss=0.05891, over 4276097.08 frames. ], batch size: 703, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 17:53:13,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1863276.0, ans=0.2 2023-06-27 17:53:18,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1863276.0, ans=0.125 2023-06-27 17:53:40,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1863336.0, ans=0.125 2023-06-27 17:54:13,841 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.087e+02 7.294e+02 1.095e+03 1.659e+03 3.151e+03, threshold=2.190e+03, percent-clipped=1.0 2023-06-27 17:54:41,739 INFO [train.py:996] (3/4) Epoch 11, batch 5650, loss[loss=0.2245, simple_loss=0.2954, pruned_loss=0.07676, over 21509.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2997, pruned_loss=0.06176, over 4280866.26 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 17:55:20,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2023-06-27 17:56:09,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1863816.0, ans=0.0 2023-06-27 17:56:19,727 INFO [train.py:996] (3/4) Epoch 11, batch 5700, loss[loss=0.1865, simple_loss=0.2662, pruned_loss=0.05342, over 21362.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2979, pruned_loss=0.06293, over 4289256.42 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:56:24,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-27 17:56:47,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1863936.0, ans=0.125 2023-06-27 17:57:26,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1864056.0, ans=0.0 2023-06-27 17:57:32,517 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.653e+02 6.609e+02 9.381e+02 1.350e+03 3.463e+03, threshold=1.876e+03, percent-clipped=9.0 2023-06-27 17:58:01,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1864116.0, ans=0.125 2023-06-27 17:58:13,675 INFO [train.py:996] (3/4) Epoch 11, batch 5750, loss[loss=0.1696, simple_loss=0.2501, pruned_loss=0.04461, over 21338.00 frames. ], tot_loss[loss=0.2082, simple_loss=0.2949, pruned_loss=0.06073, over 4287383.60 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 17:58:14,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1864176.0, ans=0.1 2023-06-27 17:58:22,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1864176.0, ans=0.5 2023-06-27 17:58:25,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1864176.0, ans=0.1 2023-06-27 17:58:40,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-27 17:58:54,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1864296.0, ans=0.125 2023-06-27 17:59:18,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1864356.0, ans=0.2 2023-06-27 17:59:51,492 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-27 17:59:56,965 INFO [train.py:996] (3/4) Epoch 11, batch 5800, loss[loss=0.2173, simple_loss=0.3069, pruned_loss=0.06379, over 21589.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2953, pruned_loss=0.05953, over 4287430.47 frames. ], batch size: 230, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:00:11,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=15.0 2023-06-27 18:00:30,097 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2023-06-27 18:00:56,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-27 18:01:03,677 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1864656.0, ans=0.0 2023-06-27 18:01:04,533 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.579e+02 7.155e+02 1.088e+03 1.847e+03 4.141e+03, threshold=2.176e+03, percent-clipped=25.0 2023-06-27 18:01:05,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1864656.0, ans=0.125 2023-06-27 18:01:13,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1864656.0, ans=0.125 2023-06-27 18:01:41,149 INFO [train.py:996] (3/4) Epoch 11, batch 5850, loss[loss=0.1998, simple_loss=0.2771, pruned_loss=0.06129, over 21205.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2928, pruned_loss=0.0568, over 4286911.32 frames. ], batch size: 608, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:02:01,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1864836.0, ans=0.125 2023-06-27 18:02:13,952 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1864836.0, ans=0.0 2023-06-27 18:02:14,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-27 18:02:23,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1864896.0, ans=0.125 2023-06-27 18:02:58,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1864956.0, ans=0.2 2023-06-27 18:03:17,826 INFO [train.py:996] (3/4) Epoch 11, batch 5900, loss[loss=0.1632, simple_loss=0.2494, pruned_loss=0.03849, over 21620.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.286, pruned_loss=0.05279, over 4285486.36 frames. ], batch size: 230, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:03:35,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-27 18:03:42,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1865136.0, ans=0.0 2023-06-27 18:03:58,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1865196.0, ans=0.2 2023-06-27 18:04:28,080 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.367e+02 6.471e+02 9.679e+02 1.352e+03 2.438e+03, threshold=1.936e+03, percent-clipped=4.0 2023-06-27 18:04:54,752 INFO [train.py:996] (3/4) Epoch 11, batch 5950, loss[loss=0.1821, simple_loss=0.249, pruned_loss=0.0576, over 21637.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2844, pruned_loss=0.05607, over 4292670.41 frames. ], batch size: 247, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:05:00,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1865376.0, ans=0.1 2023-06-27 18:05:35,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1865496.0, ans=0.125 2023-06-27 18:05:41,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1865496.0, ans=0.125 2023-06-27 18:06:25,141 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-27 18:06:33,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1865616.0, ans=0.1 2023-06-27 18:06:37,207 INFO [train.py:996] (3/4) Epoch 11, batch 6000, loss[loss=0.1845, simple_loss=0.2268, pruned_loss=0.07111, over 20039.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2788, pruned_loss=0.05816, over 4285353.72 frames. ], batch size: 703, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 18:06:37,207 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 18:06:56,346 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2612, simple_loss=0.354, pruned_loss=0.08419, over 1796401.00 frames. 2023-06-27 18:06:56,347 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 18:08:10,062 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.353e+02 5.907e+02 8.109e+02 1.325e+03 2.971e+03, threshold=1.622e+03, percent-clipped=7.0 2023-06-27 18:08:39,981 INFO [train.py:996] (3/4) Epoch 11, batch 6050, loss[loss=0.1631, simple_loss=0.2516, pruned_loss=0.03725, over 21698.00 frames. ], tot_loss[loss=0.1963, simple_loss=0.2744, pruned_loss=0.0591, over 4276651.28 frames. ], batch size: 332, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:08:54,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1865976.0, ans=0.125 2023-06-27 18:09:50,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-27 18:10:17,465 INFO [train.py:996] (3/4) Epoch 11, batch 6100, loss[loss=0.2182, simple_loss=0.2903, pruned_loss=0.07304, over 21733.00 frames. ], tot_loss[loss=0.1944, simple_loss=0.2734, pruned_loss=0.05772, over 4277475.33 frames. ], batch size: 389, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:11:29,681 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.047e+02 7.065e+02 1.029e+03 1.365e+03 3.489e+03, threshold=2.059e+03, percent-clipped=16.0 2023-06-27 18:11:59,719 INFO [train.py:996] (3/4) Epoch 11, batch 6150, loss[loss=0.193, simple_loss=0.2674, pruned_loss=0.05933, over 21982.00 frames. ], tot_loss[loss=0.1983, simple_loss=0.2767, pruned_loss=0.05998, over 4286456.71 frames. ], batch size: 119, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:13:19,027 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-27 18:13:30,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1866816.0, ans=0.1 2023-06-27 18:13:38,540 INFO [train.py:996] (3/4) Epoch 11, batch 6200, loss[loss=0.2572, simple_loss=0.3418, pruned_loss=0.08634, over 21824.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2812, pruned_loss=0.06019, over 4280847.66 frames. ], batch size: 415, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:13:41,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1866876.0, ans=0.04949747468305833 2023-06-27 18:14:01,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1866936.0, ans=0.0 2023-06-27 18:14:19,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1866936.0, ans=0.0 2023-06-27 18:14:23,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1866996.0, ans=0.125 2023-06-27 18:14:30,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-27 18:14:35,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1866996.0, ans=0.2 2023-06-27 18:14:39,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1867056.0, ans=0.0 2023-06-27 18:14:52,447 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.390e+02 7.354e+02 1.075e+03 1.607e+03 4.153e+03, threshold=2.150e+03, percent-clipped=10.0 2023-06-27 18:15:10,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1867116.0, ans=0.125 2023-06-27 18:15:10,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1867116.0, ans=0.95 2023-06-27 18:15:18,547 INFO [train.py:996] (3/4) Epoch 11, batch 6250, loss[loss=0.2149, simple_loss=0.3308, pruned_loss=0.04953, over 21227.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.287, pruned_loss=0.06031, over 4279179.35 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:16:45,759 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-27 18:17:10,379 INFO [train.py:996] (3/4) Epoch 11, batch 6300, loss[loss=0.2318, simple_loss=0.2962, pruned_loss=0.08369, over 21212.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2908, pruned_loss=0.05994, over 4277559.27 frames. ], batch size: 143, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:17:31,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1867536.0, ans=0.2 2023-06-27 18:17:33,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1867536.0, ans=0.2 2023-06-27 18:17:47,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1867596.0, ans=0.125 2023-06-27 18:17:53,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1867596.0, ans=0.125 2023-06-27 18:17:57,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1867596.0, ans=0.125 2023-06-27 18:18:17,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.254e+02 6.166e+02 8.295e+02 1.136e+03 2.739e+03, threshold=1.659e+03, percent-clipped=3.0 2023-06-27 18:18:52,469 INFO [train.py:996] (3/4) Epoch 11, batch 6350, loss[loss=0.2367, simple_loss=0.3243, pruned_loss=0.07452, over 21837.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2927, pruned_loss=0.06351, over 4279102.56 frames. ], batch size: 118, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:19:49,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1867896.0, ans=0.0 2023-06-27 18:19:58,222 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-27 18:20:21,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1868016.0, ans=0.125 2023-06-27 18:20:40,583 INFO [train.py:996] (3/4) Epoch 11, batch 6400, loss[loss=0.2751, simple_loss=0.3446, pruned_loss=0.1028, over 21425.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2976, pruned_loss=0.06721, over 4278892.65 frames. ], batch size: 471, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 18:20:41,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1868076.0, ans=0.125 2023-06-27 18:21:25,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-27 18:21:28,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1868196.0, ans=0.125 2023-06-27 18:21:55,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.666e+02 7.590e+02 1.060e+03 1.570e+03 3.138e+03, threshold=2.120e+03, percent-clipped=19.0 2023-06-27 18:22:14,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1868316.0, ans=0.125 2023-06-27 18:22:19,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1868316.0, ans=0.125 2023-06-27 18:22:23,555 INFO [train.py:996] (3/4) Epoch 11, batch 6450, loss[loss=0.1771, simple_loss=0.2586, pruned_loss=0.04782, over 21222.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2998, pruned_loss=0.06632, over 4277343.15 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:22:29,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-27 18:22:40,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1868436.0, ans=0.1 2023-06-27 18:22:42,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1868436.0, ans=0.125 2023-06-27 18:23:30,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1868556.0, ans=0.95 2023-06-27 18:23:37,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1868556.0, ans=0.1 2023-06-27 18:23:43,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1868616.0, ans=0.0 2023-06-27 18:23:43,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1868616.0, ans=0.125 2023-06-27 18:24:06,993 INFO [train.py:996] (3/4) Epoch 11, batch 6500, loss[loss=0.1775, simple_loss=0.2466, pruned_loss=0.05417, over 21550.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2935, pruned_loss=0.06528, over 4276470.17 frames. ], batch size: 213, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:24:35,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1868736.0, ans=0.0 2023-06-27 18:25:01,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1868796.0, ans=0.0 2023-06-27 18:25:20,916 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.722e+02 7.121e+02 1.016e+03 1.758e+03 3.430e+03, threshold=2.032e+03, percent-clipped=12.0 2023-06-27 18:25:32,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1868916.0, ans=0.0 2023-06-27 18:25:48,827 INFO [train.py:996] (3/4) Epoch 11, batch 6550, loss[loss=0.2066, simple_loss=0.2806, pruned_loss=0.06634, over 21428.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2939, pruned_loss=0.06468, over 4280005.70 frames. ], batch size: 211, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:25:57,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1868976.0, ans=0.125 2023-06-27 18:26:13,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1869036.0, ans=0.2 2023-06-27 18:26:33,047 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:27:13,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1869216.0, ans=0.1 2023-06-27 18:27:13,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1869216.0, ans=0.1 2023-06-27 18:27:31,153 INFO [train.py:996] (3/4) Epoch 11, batch 6600, loss[loss=0.1735, simple_loss=0.2364, pruned_loss=0.05529, over 21209.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2881, pruned_loss=0.06374, over 4278977.56 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:27:31,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1869276.0, ans=0.125 2023-06-27 18:27:35,234 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1869276.0, ans=0.2 2023-06-27 18:27:36,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1869276.0, ans=0.125 2023-06-27 18:27:45,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1869276.0, ans=0.2 2023-06-27 18:27:47,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1869276.0, ans=0.2 2023-06-27 18:28:26,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1869396.0, ans=0.0 2023-06-27 18:28:50,865 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.168e+02 6.592e+02 1.007e+03 1.403e+03 3.039e+03, threshold=2.014e+03, percent-clipped=10.0 2023-06-27 18:28:55,322 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-06-27 18:29:05,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1869516.0, ans=0.1 2023-06-27 18:29:12,961 INFO [train.py:996] (3/4) Epoch 11, batch 6650, loss[loss=0.1725, simple_loss=0.2407, pruned_loss=0.05215, over 21834.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2798, pruned_loss=0.06142, over 4274422.74 frames. ], batch size: 98, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:30:53,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1869816.0, ans=0.07 2023-06-27 18:30:59,832 INFO [train.py:996] (3/4) Epoch 11, batch 6700, loss[loss=0.2522, simple_loss=0.3103, pruned_loss=0.09703, over 21474.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2747, pruned_loss=0.06133, over 4275391.69 frames. ], batch size: 509, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:31:34,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1869936.0, ans=0.2 2023-06-27 18:31:59,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1870056.0, ans=0.125 2023-06-27 18:32:02,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1870056.0, ans=0.2 2023-06-27 18:32:16,558 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.214e+02 6.879e+02 9.707e+02 1.410e+03 2.811e+03, threshold=1.941e+03, percent-clipped=3.0 2023-06-27 18:32:42,376 INFO [train.py:996] (3/4) Epoch 11, batch 6750, loss[loss=0.1892, simple_loss=0.2596, pruned_loss=0.05937, over 21797.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2731, pruned_loss=0.06193, over 4266631.08 frames. ], batch size: 118, lr: 2.69e-03, grad_scale: 8.0 2023-06-27 18:33:10,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1870236.0, ans=0.2 2023-06-27 18:33:23,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1870236.0, ans=0.1 2023-06-27 18:33:40,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1870356.0, ans=0.0 2023-06-27 18:33:49,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1870356.0, ans=0.0 2023-06-27 18:34:01,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1870416.0, ans=0.125 2023-06-27 18:34:23,480 INFO [train.py:996] (3/4) Epoch 11, batch 6800, loss[loss=0.2126, simple_loss=0.2766, pruned_loss=0.07433, over 21791.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2746, pruned_loss=0.06351, over 4261055.63 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:34:30,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1870476.0, ans=0.0 2023-06-27 18:34:53,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1870536.0, ans=0.1 2023-06-27 18:35:15,466 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:35:39,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.574e+02 7.159e+02 9.186e+02 1.470e+03 3.415e+03, threshold=1.837e+03, percent-clipped=10.0 2023-06-27 18:35:57,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1870716.0, ans=0.125 2023-06-27 18:36:00,268 INFO [train.py:996] (3/4) Epoch 11, batch 6850, loss[loss=0.1738, simple_loss=0.2391, pruned_loss=0.05427, over 21557.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2737, pruned_loss=0.06436, over 4267222.82 frames. ], batch size: 230, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:36:41,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1870896.0, ans=0.2 2023-06-27 18:37:14,015 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-27 18:37:30,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1871016.0, ans=0.125 2023-06-27 18:37:43,671 INFO [train.py:996] (3/4) Epoch 11, batch 6900, loss[loss=0.2292, simple_loss=0.3044, pruned_loss=0.07701, over 21649.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2763, pruned_loss=0.06505, over 4267312.62 frames. ], batch size: 471, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:37:46,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1871076.0, ans=0.0 2023-06-27 18:38:25,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-27 18:38:25,040 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-27 18:38:54,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1871256.0, ans=0.0 2023-06-27 18:38:59,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1871256.0, ans=0.0 2023-06-27 18:39:04,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1871256.0, ans=0.1 2023-06-27 18:39:05,833 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.235e+02 7.048e+02 1.193e+03 1.711e+03 4.903e+03, threshold=2.385e+03, percent-clipped=22.0 2023-06-27 18:39:06,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1871256.0, ans=0.2 2023-06-27 18:39:11,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1871316.0, ans=0.1 2023-06-27 18:39:18,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1871316.0, ans=0.2 2023-06-27 18:39:25,778 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1871316.0, ans=0.125 2023-06-27 18:39:31,804 INFO [train.py:996] (3/4) Epoch 11, batch 6950, loss[loss=0.2943, simple_loss=0.3493, pruned_loss=0.1197, over 21305.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2801, pruned_loss=0.06361, over 4271661.07 frames. ], batch size: 507, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:40:38,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-27 18:40:45,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1871556.0, ans=0.125 2023-06-27 18:40:47,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1871556.0, ans=0.04949747468305833 2023-06-27 18:40:50,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1871556.0, ans=0.0 2023-06-27 18:41:14,926 INFO [train.py:996] (3/4) Epoch 11, batch 7000, loss[loss=0.2142, simple_loss=0.288, pruned_loss=0.07024, over 21353.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2828, pruned_loss=0.06542, over 4274196.72 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:41:50,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1871736.0, ans=0.0 2023-06-27 18:42:11,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1871796.0, ans=15.0 2023-06-27 18:42:26,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1871856.0, ans=0.0 2023-06-27 18:42:27,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1871856.0, ans=0.125 2023-06-27 18:42:31,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1871856.0, ans=0.125 2023-06-27 18:42:31,979 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.526e+02 6.963e+02 9.301e+02 1.305e+03 2.856e+03, threshold=1.860e+03, percent-clipped=1.0 2023-06-27 18:42:32,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1871856.0, ans=0.125 2023-06-27 18:42:58,607 INFO [train.py:996] (3/4) Epoch 11, batch 7050, loss[loss=0.1884, simple_loss=0.2944, pruned_loss=0.04119, over 21270.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2798, pruned_loss=0.0641, over 4268717.91 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:43:35,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1872036.0, ans=0.0 2023-06-27 18:44:47,738 INFO [train.py:996] (3/4) Epoch 11, batch 7100, loss[loss=0.1574, simple_loss=0.2355, pruned_loss=0.03968, over 21475.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2844, pruned_loss=0.06559, over 4272180.48 frames. ], batch size: 211, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:45:05,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1872276.0, ans=10.0 2023-06-27 18:45:05,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1872276.0, ans=0.125 2023-06-27 18:45:06,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-27 18:45:46,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1872456.0, ans=10.0 2023-06-27 18:46:00,899 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-27 18:46:03,434 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.854e+02 6.087e+02 7.876e+02 1.187e+03 3.248e+03, threshold=1.575e+03, percent-clipped=9.0 2023-06-27 18:46:25,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1872516.0, ans=0.125 2023-06-27 18:46:27,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1872516.0, ans=0.0 2023-06-27 18:46:30,033 INFO [train.py:996] (3/4) Epoch 11, batch 7150, loss[loss=0.228, simple_loss=0.3053, pruned_loss=0.07539, over 21978.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2822, pruned_loss=0.06384, over 4268634.98 frames. ], batch size: 317, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:47:21,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1872696.0, ans=0.0 2023-06-27 18:47:34,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1872756.0, ans=0.125 2023-06-27 18:48:18,351 INFO [train.py:996] (3/4) Epoch 11, batch 7200, loss[loss=0.197, simple_loss=0.2644, pruned_loss=0.06482, over 21225.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2837, pruned_loss=0.06488, over 4266777.02 frames. ], batch size: 176, lr: 2.69e-03, grad_scale: 32.0 2023-06-27 18:48:47,965 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-27 18:49:21,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1873056.0, ans=0.125 2023-06-27 18:49:35,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.391e+02 8.685e+02 1.394e+03 1.830e+03 3.525e+03, threshold=2.788e+03, percent-clipped=36.0 2023-06-27 18:49:44,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1873116.0, ans=0.0 2023-06-27 18:50:04,640 INFO [train.py:996] (3/4) Epoch 11, batch 7250, loss[loss=0.1773, simple_loss=0.242, pruned_loss=0.0563, over 21525.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2801, pruned_loss=0.06504, over 4256800.14 frames. ], batch size: 230, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:50:05,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1873176.0, ans=0.0 2023-06-27 18:50:16,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1873176.0, ans=0.0 2023-06-27 18:50:18,610 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:51:23,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1873416.0, ans=0.2 2023-06-27 18:51:47,392 INFO [train.py:996] (3/4) Epoch 11, batch 7300, loss[loss=0.1991, simple_loss=0.2661, pruned_loss=0.06607, over 21850.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2743, pruned_loss=0.06395, over 4267206.92 frames. ], batch size: 98, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:51:48,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1873476.0, ans=0.125 2023-06-27 18:51:57,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1873476.0, ans=0.125 2023-06-27 18:52:36,810 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1873596.0, ans=0.125 2023-06-27 18:53:00,322 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.280e+02 7.298e+02 1.227e+03 1.780e+03 3.301e+03, threshold=2.454e+03, percent-clipped=5.0 2023-06-27 18:53:00,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1873656.0, ans=0.125 2023-06-27 18:53:29,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1873776.0, ans=0.0 2023-06-27 18:53:29,812 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.91 vs. limit=15.0 2023-06-27 18:53:30,257 INFO [train.py:996] (3/4) Epoch 11, batch 7350, loss[loss=0.2412, simple_loss=0.322, pruned_loss=0.08015, over 21495.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2741, pruned_loss=0.06498, over 4267302.93 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:53:34,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1873776.0, ans=0.125 2023-06-27 18:53:52,757 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 18:53:54,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1873836.0, ans=0.125 2023-06-27 18:54:32,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1873956.0, ans=0.125 2023-06-27 18:54:54,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1874016.0, ans=0.2 2023-06-27 18:55:13,770 INFO [train.py:996] (3/4) Epoch 11, batch 7400, loss[loss=0.2043, simple_loss=0.3056, pruned_loss=0.0515, over 21838.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2793, pruned_loss=0.0667, over 4267781.77 frames. ], batch size: 372, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:55:29,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1874136.0, ans=0.125 2023-06-27 18:55:31,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1874136.0, ans=0.0 2023-06-27 18:55:47,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1874136.0, ans=0.125 2023-06-27 18:55:54,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1874196.0, ans=0.1 2023-06-27 18:56:08,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1874196.0, ans=0.0 2023-06-27 18:56:31,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.427e+02 7.089e+02 1.051e+03 1.718e+03 3.603e+03, threshold=2.102e+03, percent-clipped=3.0 2023-06-27 18:56:57,308 INFO [train.py:996] (3/4) Epoch 11, batch 7450, loss[loss=0.1662, simple_loss=0.2097, pruned_loss=0.06132, over 20076.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2775, pruned_loss=0.06599, over 4272918.17 frames. ], batch size: 702, lr: 2.69e-03, grad_scale: 16.0 2023-06-27 18:57:09,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1874376.0, ans=0.125 2023-06-27 18:57:16,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1874436.0, ans=0.125 2023-06-27 18:57:51,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1874496.0, ans=0.0 2023-06-27 18:57:53,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1874496.0, ans=0.125 2023-06-27 18:58:05,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1874556.0, ans=0.125 2023-06-27 18:58:18,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1874556.0, ans=0.0 2023-06-27 18:58:25,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1874616.0, ans=0.04949747468305833 2023-06-27 18:58:41,418 INFO [train.py:996] (3/4) Epoch 11, batch 7500, loss[loss=0.178, simple_loss=0.2362, pruned_loss=0.05994, over 20836.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2821, pruned_loss=0.0672, over 4268372.28 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 18:59:27,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1874796.0, ans=0.125 2023-06-27 18:59:28,906 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1874796.0, ans=0.2 2023-06-27 18:59:48,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1874856.0, ans=0.1 2023-06-27 19:00:04,710 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.389e+02 7.977e+02 1.325e+03 1.991e+03 3.400e+03, threshold=2.650e+03, percent-clipped=21.0 2023-06-27 19:00:24,549 INFO [train.py:996] (3/4) Epoch 11, batch 7550, loss[loss=0.1916, simple_loss=0.261, pruned_loss=0.06111, over 21218.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2907, pruned_loss=0.06678, over 4266837.27 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:01:55,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1875216.0, ans=0.2 2023-06-27 19:02:04,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1875276.0, ans=10.0 2023-06-27 19:02:05,493 INFO [train.py:996] (3/4) Epoch 11, batch 7600, loss[loss=0.22, simple_loss=0.2906, pruned_loss=0.07469, over 21391.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2907, pruned_loss=0.06606, over 4276048.38 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:02:14,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1875276.0, ans=0.0 2023-06-27 19:02:18,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-27 19:02:19,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-27 19:02:57,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1875396.0, ans=0.125 2023-06-27 19:03:11,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1875456.0, ans=0.0 2023-06-27 19:03:28,851 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.792e+02 7.250e+02 9.858e+02 1.337e+03 3.374e+03, threshold=1.972e+03, percent-clipped=5.0 2023-06-27 19:03:42,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875516.0, ans=0.1 2023-06-27 19:03:47,205 INFO [train.py:996] (3/4) Epoch 11, batch 7650, loss[loss=0.2102, simple_loss=0.2854, pruned_loss=0.06747, over 21485.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2897, pruned_loss=0.06708, over 4287542.64 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:05:06,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1875756.0, ans=0.0 2023-06-27 19:05:30,396 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.80 vs. limit=15.0 2023-06-27 19:05:30,747 INFO [train.py:996] (3/4) Epoch 11, batch 7700, loss[loss=0.1714, simple_loss=0.2291, pruned_loss=0.05681, over 20779.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.293, pruned_loss=0.06918, over 4288816.56 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:05:31,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875876.0, ans=0.1 2023-06-27 19:06:53,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1876056.0, ans=0.125 2023-06-27 19:06:59,792 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.712e+02 8.160e+02 1.175e+03 1.754e+03 4.757e+03, threshold=2.350e+03, percent-clipped=23.0 2023-06-27 19:07:00,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1876116.0, ans=0.125 2023-06-27 19:07:09,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1876116.0, ans=0.1 2023-06-27 19:07:10,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1876116.0, ans=0.125 2023-06-27 19:07:16,808 INFO [train.py:996] (3/4) Epoch 11, batch 7750, loss[loss=0.2663, simple_loss=0.3824, pruned_loss=0.07511, over 21198.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3003, pruned_loss=0.06953, over 4285759.85 frames. ], batch size: 548, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:07:37,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1876176.0, ans=0.125 2023-06-27 19:07:44,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1876236.0, ans=0.0 2023-06-27 19:08:03,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1876236.0, ans=0.1 2023-06-27 19:08:11,680 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=12.0 2023-06-27 19:08:26,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1876356.0, ans=0.1 2023-06-27 19:08:34,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1876356.0, ans=0.125 2023-06-27 19:08:56,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1876416.0, ans=0.125 2023-06-27 19:09:10,455 INFO [train.py:996] (3/4) Epoch 11, batch 7800, loss[loss=0.2151, simple_loss=0.3021, pruned_loss=0.06404, over 21748.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3022, pruned_loss=0.07052, over 4282168.18 frames. ], batch size: 391, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:09:41,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1876536.0, ans=0.2 2023-06-27 19:10:05,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1876596.0, ans=0.125 2023-06-27 19:10:07,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1876596.0, ans=0.0 2023-06-27 19:10:26,679 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.510e+02 6.767e+02 1.181e+03 1.586e+03 4.451e+03, threshold=2.363e+03, percent-clipped=7.0 2023-06-27 19:10:53,757 INFO [train.py:996] (3/4) Epoch 11, batch 7850, loss[loss=0.1728, simple_loss=0.2237, pruned_loss=0.06097, over 20760.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2939, pruned_loss=0.06942, over 4273053.78 frames. ], batch size: 609, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:12:03,618 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-27 19:12:08,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1876956.0, ans=0.0 2023-06-27 19:12:40,429 INFO [train.py:996] (3/4) Epoch 11, batch 7900, loss[loss=0.1757, simple_loss=0.2379, pruned_loss=0.05682, over 21431.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.287, pruned_loss=0.06789, over 4261289.40 frames. ], batch size: 212, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:12:59,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1877076.0, ans=0.2 2023-06-27 19:13:22,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1877196.0, ans=0.0 2023-06-27 19:13:26,943 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-27 19:13:31,435 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1877196.0, ans=0.125 2023-06-27 19:14:01,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1877256.0, ans=0.1 2023-06-27 19:14:08,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.410e+02 7.562e+02 1.142e+03 1.795e+03 4.843e+03, threshold=2.283e+03, percent-clipped=15.0 2023-06-27 19:14:29,969 INFO [train.py:996] (3/4) Epoch 11, batch 7950, loss[loss=0.2279, simple_loss=0.312, pruned_loss=0.07193, over 21881.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2924, pruned_loss=0.06837, over 4256343.77 frames. ], batch size: 316, lr: 2.68e-03, grad_scale: 8.0 2023-06-27 19:15:15,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1877496.0, ans=0.0 2023-06-27 19:16:01,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1877616.0, ans=0.1 2023-06-27 19:16:22,060 INFO [train.py:996] (3/4) Epoch 11, batch 8000, loss[loss=0.2587, simple_loss=0.3403, pruned_loss=0.08852, over 21445.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2962, pruned_loss=0.06974, over 4259130.44 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:16:28,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1877676.0, ans=0.07 2023-06-27 19:16:38,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1877736.0, ans=0.125 2023-06-27 19:17:26,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1877796.0, ans=0.0 2023-06-27 19:17:29,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-27 19:17:51,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.031e+02 6.364e+02 9.395e+02 1.417e+03 3.378e+03, threshold=1.879e+03, percent-clipped=5.0 2023-06-27 19:17:54,502 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-27 19:18:08,685 INFO [train.py:996] (3/4) Epoch 11, batch 8050, loss[loss=0.3001, simple_loss=0.3848, pruned_loss=0.1076, over 21561.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2999, pruned_loss=0.07037, over 4261285.28 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:18:23,306 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:18:40,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1878036.0, ans=0.2 2023-06-27 19:19:46,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1878216.0, ans=0.125 2023-06-27 19:19:53,006 INFO [train.py:996] (3/4) Epoch 11, batch 8100, loss[loss=0.2593, simple_loss=0.3134, pruned_loss=0.1026, over 21699.00 frames. ], tot_loss[loss=0.22, simple_loss=0.2987, pruned_loss=0.07069, over 4268434.14 frames. ], batch size: 507, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:20:18,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1878336.0, ans=0.125 2023-06-27 19:20:27,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1878336.0, ans=0.125 2023-06-27 19:20:48,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1878396.0, ans=0.125 2023-06-27 19:21:22,429 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.025e+02 8.290e+02 1.329e+03 2.139e+03 5.514e+03, threshold=2.658e+03, percent-clipped=35.0 2023-06-27 19:21:48,876 INFO [train.py:996] (3/4) Epoch 11, batch 8150, loss[loss=0.255, simple_loss=0.3689, pruned_loss=0.07054, over 21165.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3052, pruned_loss=0.072, over 4260462.77 frames. ], batch size: 548, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:21:52,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1878576.0, ans=0.2 2023-06-27 19:22:52,335 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-27 19:23:31,171 INFO [train.py:996] (3/4) Epoch 11, batch 8200, loss[loss=0.1917, simple_loss=0.249, pruned_loss=0.06722, over 21117.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2983, pruned_loss=0.06966, over 4264247.95 frames. ], batch size: 143, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:24:04,160 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=12.0 2023-06-27 19:24:38,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.16 vs. limit=12.0 2023-06-27 19:24:53,468 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.441e+02 7.151e+02 1.119e+03 1.525e+03 4.860e+03, threshold=2.239e+03, percent-clipped=3.0 2023-06-27 19:25:04,148 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:25:07,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1879116.0, ans=0.125 2023-06-27 19:25:15,172 INFO [train.py:996] (3/4) Epoch 11, batch 8250, loss[loss=0.2707, simple_loss=0.3624, pruned_loss=0.08953, over 21624.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2972, pruned_loss=0.0695, over 4265907.72 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:25:27,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1879176.0, ans=0.04949747468305833 2023-06-27 19:25:39,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1879236.0, ans=0.125 2023-06-27 19:26:12,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1879296.0, ans=0.1 2023-06-27 19:26:34,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1879356.0, ans=0.0 2023-06-27 19:26:45,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-27 19:26:56,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1879416.0, ans=0.0 2023-06-27 19:26:59,233 INFO [train.py:996] (3/4) Epoch 11, batch 8300, loss[loss=0.2113, simple_loss=0.2958, pruned_loss=0.06338, over 21631.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.295, pruned_loss=0.06707, over 4270405.65 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:27:56,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1879596.0, ans=0.1 2023-06-27 19:28:08,702 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-27 19:28:25,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.537e+02 6.833e+02 1.058e+03 1.562e+03 3.226e+03, threshold=2.116e+03, percent-clipped=10.0 2023-06-27 19:28:30,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1879716.0, ans=0.125 2023-06-27 19:28:41,967 INFO [train.py:996] (3/4) Epoch 11, batch 8350, loss[loss=0.222, simple_loss=0.2903, pruned_loss=0.07686, over 21746.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2943, pruned_loss=0.06571, over 4265664.97 frames. ], batch size: 102, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:28:44,239 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1879776.0, ans=0.125 2023-06-27 19:29:13,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1879836.0, ans=0.0 2023-06-27 19:30:29,632 INFO [train.py:996] (3/4) Epoch 11, batch 8400, loss[loss=0.2146, simple_loss=0.2944, pruned_loss=0.06743, over 20784.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2912, pruned_loss=0.06302, over 4268728.25 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:30:32,851 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.81 vs. limit=6.0 2023-06-27 19:30:51,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1880136.0, ans=0.125 2023-06-27 19:31:21,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1880196.0, ans=0.015 2023-06-27 19:31:37,169 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1880256.0, ans=0.1 2023-06-27 19:31:51,130 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.798e+02 6.790e+02 1.029e+03 1.707e+03 4.211e+03, threshold=2.059e+03, percent-clipped=16.0 2023-06-27 19:32:11,260 INFO [train.py:996] (3/4) Epoch 11, batch 8450, loss[loss=0.1907, simple_loss=0.2754, pruned_loss=0.053, over 21503.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2894, pruned_loss=0.06176, over 4275207.92 frames. ], batch size: 212, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:33:01,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.73 vs. limit=15.0 2023-06-27 19:33:48,563 INFO [train.py:996] (3/4) Epoch 11, batch 8500, loss[loss=0.2251, simple_loss=0.2936, pruned_loss=0.07827, over 21264.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2848, pruned_loss=0.06257, over 4275705.94 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:34:10,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1880676.0, ans=0.125 2023-06-27 19:34:53,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-27 19:35:06,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1880856.0, ans=0.125 2023-06-27 19:35:17,400 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.145e+02 8.155e+02 1.098e+03 1.780e+03 3.950e+03, threshold=2.195e+03, percent-clipped=18.0 2023-06-27 19:35:23,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-27 19:35:27,224 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-27 19:35:28,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1880916.0, ans=0.125 2023-06-27 19:35:37,570 INFO [train.py:996] (3/4) Epoch 11, batch 8550, loss[loss=0.2247, simple_loss=0.3302, pruned_loss=0.05956, over 20686.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2882, pruned_loss=0.06459, over 4279811.62 frames. ], batch size: 607, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:36:08,969 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1881036.0, ans=0.125 2023-06-27 19:37:13,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1881216.0, ans=0.125 2023-06-27 19:37:26,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1881276.0, ans=0.125 2023-06-27 19:37:27,692 INFO [train.py:996] (3/4) Epoch 11, batch 8600, loss[loss=0.2605, simple_loss=0.3356, pruned_loss=0.09269, over 21536.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2947, pruned_loss=0.06694, over 4280587.76 frames. ], batch size: 414, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:38:02,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1881336.0, ans=0.0 2023-06-27 19:38:25,808 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-27 19:38:35,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1881456.0, ans=0.2 2023-06-27 19:38:50,813 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 7.000e+02 1.009e+03 1.607e+03 3.888e+03, threshold=2.018e+03, percent-clipped=13.0 2023-06-27 19:39:01,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1881516.0, ans=0.125 2023-06-27 19:39:11,182 INFO [train.py:996] (3/4) Epoch 11, batch 8650, loss[loss=0.143, simple_loss=0.2026, pruned_loss=0.0417, over 16541.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2998, pruned_loss=0.06744, over 4276974.41 frames. ], batch size: 60, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:39:26,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1881636.0, ans=0.1 2023-06-27 19:39:26,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1881636.0, ans=0.0 2023-06-27 19:40:07,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1881696.0, ans=0.125 2023-06-27 19:40:43,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1881816.0, ans=0.125 2023-06-27 19:40:52,492 INFO [train.py:996] (3/4) Epoch 11, batch 8700, loss[loss=0.193, simple_loss=0.267, pruned_loss=0.05953, over 21443.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2951, pruned_loss=0.06573, over 4274444.82 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:41:01,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1881876.0, ans=0.125 2023-06-27 19:41:07,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1881936.0, ans=0.125 2023-06-27 19:41:41,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1881996.0, ans=0.2 2023-06-27 19:41:54,451 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:41:59,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1882056.0, ans=0.1 2023-06-27 19:42:15,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.345e+02 6.737e+02 1.063e+03 1.710e+03 3.619e+03, threshold=2.126e+03, percent-clipped=15.0 2023-06-27 19:42:35,712 INFO [train.py:996] (3/4) Epoch 11, batch 8750, loss[loss=0.2148, simple_loss=0.2872, pruned_loss=0.07113, over 21249.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2921, pruned_loss=0.06616, over 4274223.95 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:42:48,823 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-27 19:43:04,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1882236.0, ans=0.2 2023-06-27 19:43:04,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1882236.0, ans=0.0 2023-06-27 19:43:06,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1882236.0, ans=0.125 2023-06-27 19:43:13,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1882296.0, ans=0.125 2023-06-27 19:43:19,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1882296.0, ans=0.2 2023-06-27 19:44:16,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1882416.0, ans=0.125 2023-06-27 19:44:19,329 INFO [train.py:996] (3/4) Epoch 11, batch 8800, loss[loss=0.29, simple_loss=0.3644, pruned_loss=0.1078, over 21442.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2999, pruned_loss=0.06787, over 4278561.76 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 19:45:49,242 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.680e+02 9.134e+02 1.413e+03 2.470e+03 4.738e+03, threshold=2.826e+03, percent-clipped=30.0 2023-06-27 19:45:54,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1882716.0, ans=0.1 2023-06-27 19:46:02,353 INFO [train.py:996] (3/4) Epoch 11, batch 8850, loss[loss=0.214, simple_loss=0.303, pruned_loss=0.06252, over 21158.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3052, pruned_loss=0.06958, over 4278605.10 frames. ], batch size: 143, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:46:19,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1882776.0, ans=0.95 2023-06-27 19:46:42,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1882836.0, ans=0.1 2023-06-27 19:46:42,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1882836.0, ans=0.125 2023-06-27 19:47:15,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1882956.0, ans=0.1 2023-06-27 19:47:18,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1882956.0, ans=0.1 2023-06-27 19:47:36,580 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-27 19:47:50,833 INFO [train.py:996] (3/4) Epoch 11, batch 8900, loss[loss=0.1981, simple_loss=0.2875, pruned_loss=0.05439, over 21860.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2991, pruned_loss=0.06802, over 4267851.88 frames. ], batch size: 372, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:47:56,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1883076.0, ans=0.1 2023-06-27 19:47:57,024 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-27 19:48:45,834 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1883196.0, ans=0.125 2023-06-27 19:49:03,187 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1883256.0, ans=0.1 2023-06-27 19:49:23,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.611e+02 6.392e+02 1.039e+03 1.753e+03 5.076e+03, threshold=2.078e+03, percent-clipped=8.0 2023-06-27 19:49:36,289 INFO [train.py:996] (3/4) Epoch 11, batch 8950, loss[loss=0.2243, simple_loss=0.314, pruned_loss=0.06736, over 21717.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2973, pruned_loss=0.06736, over 4272235.66 frames. ], batch size: 351, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:49:50,071 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:50:21,831 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-27 19:50:23,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1883496.0, ans=0.1 2023-06-27 19:51:17,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1883676.0, ans=0.0 2023-06-27 19:51:18,632 INFO [train.py:996] (3/4) Epoch 11, batch 9000, loss[loss=0.1921, simple_loss=0.2598, pruned_loss=0.06221, over 21704.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.293, pruned_loss=0.06762, over 4273059.78 frames. ], batch size: 333, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:51:18,632 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 19:51:37,891 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2621, simple_loss=0.3543, pruned_loss=0.08494, over 1796401.00 frames. 2023-06-27 19:51:37,892 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 19:51:47,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1883676.0, ans=10.0 2023-06-27 19:52:04,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1883736.0, ans=0.2 2023-06-27 19:52:44,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1883856.0, ans=0.0 2023-06-27 19:52:55,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1883856.0, ans=0.125 2023-06-27 19:53:00,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=22.5 2023-06-27 19:53:04,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.323e+02 6.298e+02 8.263e+02 1.367e+03 3.761e+03, threshold=1.653e+03, percent-clipped=12.0 2023-06-27 19:53:28,428 INFO [train.py:996] (3/4) Epoch 11, batch 9050, loss[loss=0.2344, simple_loss=0.3107, pruned_loss=0.07906, over 21760.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2904, pruned_loss=0.06475, over 4268826.24 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:53:59,010 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-27 19:55:06,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1884216.0, ans=0.0 2023-06-27 19:55:13,467 INFO [train.py:996] (3/4) Epoch 11, batch 9100, loss[loss=0.1998, simple_loss=0.3003, pruned_loss=0.04963, over 21899.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2967, pruned_loss=0.06728, over 4273109.23 frames. ], batch size: 317, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:55:33,184 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1884276.0, ans=0.125 2023-06-27 19:55:38,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1884336.0, ans=0.0 2023-06-27 19:55:56,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1884396.0, ans=0.125 2023-06-27 19:56:08,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1884396.0, ans=0.125 2023-06-27 19:56:33,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=1884456.0, ans=15.0 2023-06-27 19:56:33,116 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.20 vs. limit=15.0 2023-06-27 19:56:38,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1884456.0, ans=0.0 2023-06-27 19:56:44,830 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 7.145e+02 1.042e+03 1.570e+03 3.461e+03, threshold=2.085e+03, percent-clipped=19.0 2023-06-27 19:56:45,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1884516.0, ans=0.0 2023-06-27 19:57:03,243 INFO [train.py:996] (3/4) Epoch 11, batch 9150, loss[loss=0.2931, simple_loss=0.3784, pruned_loss=0.1039, over 21518.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2982, pruned_loss=0.06526, over 4270897.53 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 19:57:48,619 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 19:58:13,318 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-06-27 19:58:20,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1884756.0, ans=0.125 2023-06-27 19:58:31,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1884816.0, ans=0.125 2023-06-27 19:58:31,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1884816.0, ans=0.1 2023-06-27 19:58:45,956 INFO [train.py:996] (3/4) Epoch 11, batch 9200, loss[loss=0.2297, simple_loss=0.3244, pruned_loss=0.06747, over 21214.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2988, pruned_loss=0.06412, over 4275373.14 frames. ], batch size: 548, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 20:00:10,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1885116.0, ans=0.125 2023-06-27 20:00:16,344 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.692e+02 7.193e+02 1.189e+03 2.039e+03 4.796e+03, threshold=2.378e+03, percent-clipped=22.0 2023-06-27 20:00:28,228 INFO [train.py:996] (3/4) Epoch 11, batch 9250, loss[loss=0.2366, simple_loss=0.2839, pruned_loss=0.09461, over 21423.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.3026, pruned_loss=0.06689, over 4276098.80 frames. ], batch size: 510, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:00:37,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1885176.0, ans=0.125 2023-06-27 20:01:03,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1885236.0, ans=0.125 2023-06-27 20:02:03,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1885416.0, ans=0.0 2023-06-27 20:02:17,642 INFO [train.py:996] (3/4) Epoch 11, batch 9300, loss[loss=0.2334, simple_loss=0.3372, pruned_loss=0.06484, over 21240.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2964, pruned_loss=0.06681, over 4275449.52 frames. ], batch size: 549, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:02:34,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1885536.0, ans=0.04949747468305833 2023-06-27 20:02:54,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1885596.0, ans=0.125 2023-06-27 20:03:11,004 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-27 20:03:32,228 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1885656.0, ans=0.0 2023-06-27 20:03:50,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.175e+02 5.672e+02 8.335e+02 1.329e+03 3.533e+03, threshold=1.667e+03, percent-clipped=8.0 2023-06-27 20:04:02,348 INFO [train.py:996] (3/4) Epoch 11, batch 9350, loss[loss=0.2224, simple_loss=0.311, pruned_loss=0.06689, over 21874.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.3016, pruned_loss=0.0678, over 4270260.68 frames. ], batch size: 316, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:04:33,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1885836.0, ans=0.0 2023-06-27 20:05:39,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1886016.0, ans=0.1 2023-06-27 20:05:45,786 INFO [train.py:996] (3/4) Epoch 11, batch 9400, loss[loss=0.216, simple_loss=0.2823, pruned_loss=0.07482, over 21620.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3048, pruned_loss=0.06804, over 4272432.08 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:06:21,497 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=12.0 2023-06-27 20:07:01,198 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-27 20:07:16,436 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.854e+02 7.483e+02 1.060e+03 1.789e+03 3.889e+03, threshold=2.119e+03, percent-clipped=27.0 2023-06-27 20:07:27,691 INFO [train.py:996] (3/4) Epoch 11, batch 9450, loss[loss=0.1781, simple_loss=0.2478, pruned_loss=0.05423, over 21609.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2976, pruned_loss=0.06716, over 4256908.59 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:08:13,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1886436.0, ans=0.125 2023-06-27 20:08:16,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1886496.0, ans=0.1 2023-06-27 20:08:43,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1886556.0, ans=0.0 2023-06-27 20:08:52,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1886556.0, ans=0.0 2023-06-27 20:09:02,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1886616.0, ans=0.1 2023-06-27 20:09:11,690 INFO [train.py:996] (3/4) Epoch 11, batch 9500, loss[loss=0.1728, simple_loss=0.2624, pruned_loss=0.04164, over 21748.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2899, pruned_loss=0.06551, over 4250185.06 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:09:51,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1886736.0, ans=0.125 2023-06-27 20:10:03,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1886796.0, ans=0.1 2023-06-27 20:10:21,000 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1886856.0, ans=0.0 2023-06-27 20:10:22,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1886856.0, ans=0.0 2023-06-27 20:10:38,451 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.105e+02 7.389e+02 1.115e+03 1.559e+03 4.093e+03, threshold=2.229e+03, percent-clipped=13.0 2023-06-27 20:10:49,767 INFO [train.py:996] (3/4) Epoch 11, batch 9550, loss[loss=0.2251, simple_loss=0.3179, pruned_loss=0.06618, over 21669.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.293, pruned_loss=0.06798, over 4252236.57 frames. ], batch size: 263, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:10:50,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.27 vs. limit=15.0 2023-06-27 20:10:55,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1886976.0, ans=0.0 2023-06-27 20:12:26,514 INFO [train.py:996] (3/4) Epoch 11, batch 9600, loss[loss=0.2211, simple_loss=0.2972, pruned_loss=0.07252, over 21773.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2958, pruned_loss=0.06914, over 4263328.64 frames. ], batch size: 112, lr: 2.68e-03, grad_scale: 32.0 2023-06-27 20:13:54,720 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.225e+02 7.650e+02 1.090e+03 1.713e+03 4.107e+03, threshold=2.181e+03, percent-clipped=11.0 2023-06-27 20:14:05,174 INFO [train.py:996] (3/4) Epoch 11, batch 9650, loss[loss=0.2252, simple_loss=0.3039, pruned_loss=0.07329, over 21319.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2953, pruned_loss=0.06884, over 4266126.08 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:14:40,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1887636.0, ans=0.0 2023-06-27 20:15:00,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1887696.0, ans=0.125 2023-06-27 20:15:24,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-27 20:15:42,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1887816.0, ans=0.0 2023-06-27 20:15:53,493 INFO [train.py:996] (3/4) Epoch 11, batch 9700, loss[loss=0.1988, simple_loss=0.2885, pruned_loss=0.05458, over 21353.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2974, pruned_loss=0.06885, over 4259055.19 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:16:17,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1887876.0, ans=0.125 2023-06-27 20:16:54,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1887996.0, ans=0.0 2023-06-27 20:17:15,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1888116.0, ans=0.07 2023-06-27 20:17:20,714 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.740e+02 5.952e+02 8.386e+02 1.192e+03 2.882e+03, threshold=1.677e+03, percent-clipped=3.0 2023-06-27 20:17:21,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1888116.0, ans=0.125 2023-06-27 20:17:27,268 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2023-06-27 20:17:35,536 INFO [train.py:996] (3/4) Epoch 11, batch 9750, loss[loss=0.1965, simple_loss=0.2587, pruned_loss=0.06718, over 21593.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2899, pruned_loss=0.0673, over 4263993.24 frames. ], batch size: 415, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:18:27,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-27 20:18:59,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1888416.0, ans=0.1 2023-06-27 20:19:05,003 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:19:05,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-06-27 20:19:10,797 INFO [train.py:996] (3/4) Epoch 11, batch 9800, loss[loss=0.2007, simple_loss=0.2787, pruned_loss=0.06139, over 21798.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2911, pruned_loss=0.06755, over 4267816.38 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-27 20:19:16,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1888476.0, ans=0.0 2023-06-27 20:20:18,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-27 20:20:38,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1888716.0, ans=0.125 2023-06-27 20:20:42,517 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.393e+02 6.342e+02 8.538e+02 1.222e+03 6.218e+03, threshold=1.708e+03, percent-clipped=13.0 2023-06-27 20:20:45,244 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-27 20:20:52,443 INFO [train.py:996] (3/4) Epoch 11, batch 9850, loss[loss=0.2126, simple_loss=0.2635, pruned_loss=0.08091, over 21331.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2887, pruned_loss=0.06827, over 4273418.62 frames. ], batch size: 473, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:21:14,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1888776.0, ans=0.05 2023-06-27 20:21:42,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1888896.0, ans=0.0 2023-06-27 20:22:27,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1889016.0, ans=0.07 2023-06-27 20:22:35,618 INFO [train.py:996] (3/4) Epoch 11, batch 9900, loss[loss=0.1864, simple_loss=0.2587, pruned_loss=0.05702, over 21910.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2845, pruned_loss=0.06734, over 4279885.48 frames. ], batch size: 107, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:23:14,631 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2023-06-27 20:23:19,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1889136.0, ans=0.0 2023-06-27 20:23:24,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1889196.0, ans=0.95 2023-06-27 20:23:55,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1889256.0, ans=0.125 2023-06-27 20:24:07,978 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 7.874e+02 1.115e+03 1.655e+03 5.340e+03, threshold=2.230e+03, percent-clipped=22.0 2023-06-27 20:24:12,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1889316.0, ans=0.2 2023-06-27 20:24:18,317 INFO [train.py:996] (3/4) Epoch 11, batch 9950, loss[loss=0.1858, simple_loss=0.2485, pruned_loss=0.0615, over 21566.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2861, pruned_loss=0.06925, over 4275308.29 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:24:55,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1889436.0, ans=0.125 2023-06-27 20:25:18,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1889496.0, ans=0.125 2023-06-27 20:25:22,763 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-27 20:25:25,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1889496.0, ans=0.125 2023-06-27 20:25:29,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1889556.0, ans=0.125 2023-06-27 20:25:30,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1889556.0, ans=0.125 2023-06-27 20:25:43,477 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-27 20:26:16,676 INFO [train.py:996] (3/4) Epoch 11, batch 10000, loss[loss=0.1931, simple_loss=0.2685, pruned_loss=0.05887, over 21531.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.283, pruned_loss=0.06872, over 4274686.77 frames. ], batch size: 441, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 20:27:01,471 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:27:03,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1889796.0, ans=0.0 2023-06-27 20:27:03,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1889796.0, ans=0.5 2023-06-27 20:27:46,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1889916.0, ans=0.125 2023-06-27 20:27:52,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 6.432e+02 1.028e+03 1.503e+03 2.874e+03, threshold=2.056e+03, percent-clipped=5.0 2023-06-27 20:28:01,334 INFO [train.py:996] (3/4) Epoch 11, batch 10050, loss[loss=0.2632, simple_loss=0.3747, pruned_loss=0.07584, over 19767.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2858, pruned_loss=0.06877, over 4270015.06 frames. ], batch size: 703, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:28:10,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1889976.0, ans=0.1 2023-06-27 20:28:26,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1890036.0, ans=0.125 2023-06-27 20:28:55,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1890156.0, ans=0.125 2023-06-27 20:29:11,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1890156.0, ans=0.1 2023-06-27 20:29:44,813 INFO [train.py:996] (3/4) Epoch 11, batch 10100, loss[loss=0.2071, simple_loss=0.2793, pruned_loss=0.0675, over 21611.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2843, pruned_loss=0.06763, over 4268525.66 frames. ], batch size: 230, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:30:13,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1890336.0, ans=0.125 2023-06-27 20:30:29,675 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-27 20:31:19,822 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.253e+02 6.464e+02 9.043e+02 1.577e+03 3.572e+03, threshold=1.809e+03, percent-clipped=15.0 2023-06-27 20:31:28,335 INFO [train.py:996] (3/4) Epoch 11, batch 10150, loss[loss=0.2048, simple_loss=0.286, pruned_loss=0.06176, over 21671.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2895, pruned_loss=0.06955, over 4271414.89 frames. ], batch size: 332, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:32:17,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1890696.0, ans=15.0 2023-06-27 20:32:19,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1890696.0, ans=0.0 2023-06-27 20:32:51,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1890816.0, ans=0.125 2023-06-27 20:33:06,432 INFO [train.py:996] (3/4) Epoch 11, batch 10200, loss[loss=0.1916, simple_loss=0.2653, pruned_loss=0.05893, over 21722.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2888, pruned_loss=0.06737, over 4278576.60 frames. ], batch size: 112, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:33:43,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1890996.0, ans=0.035 2023-06-27 20:33:43,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1890996.0, ans=0.0 2023-06-27 20:34:28,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1891056.0, ans=0.1 2023-06-27 20:34:41,108 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.060e+02 5.938e+02 9.160e+02 1.393e+03 3.097e+03, threshold=1.832e+03, percent-clipped=16.0 2023-06-27 20:34:45,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-27 20:34:49,774 INFO [train.py:996] (3/4) Epoch 11, batch 10250, loss[loss=0.1579, simple_loss=0.2304, pruned_loss=0.04268, over 17304.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2841, pruned_loss=0.06312, over 4267221.48 frames. ], batch size: 62, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:35:10,114 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1891236.0, ans=0.125 2023-06-27 20:35:25,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1891236.0, ans=0.125 2023-06-27 20:35:41,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1891296.0, ans=0.125 2023-06-27 20:36:05,694 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=15.0 2023-06-27 20:36:29,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1891416.0, ans=0.1 2023-06-27 20:36:38,665 INFO [train.py:996] (3/4) Epoch 11, batch 10300, loss[loss=0.2439, simple_loss=0.3468, pruned_loss=0.07053, over 19855.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2884, pruned_loss=0.06453, over 4262986.25 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:36:47,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1891476.0, ans=0.2 2023-06-27 20:36:50,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1891476.0, ans=0.0 2023-06-27 20:37:06,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1891536.0, ans=0.125 2023-06-27 20:37:53,421 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-27 20:37:54,950 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-27 20:38:04,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1891716.0, ans=0.1 2023-06-27 20:38:14,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.623e+02 8.106e+02 1.179e+03 1.696e+03 3.317e+03, threshold=2.359e+03, percent-clipped=22.0 2023-06-27 20:38:22,834 INFO [train.py:996] (3/4) Epoch 11, batch 10350, loss[loss=0.225, simple_loss=0.3153, pruned_loss=0.06738, over 21202.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2901, pruned_loss=0.06491, over 4265484.48 frames. ], batch size: 549, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:40:03,151 INFO [train.py:996] (3/4) Epoch 11, batch 10400, loss[loss=0.2107, simple_loss=0.2936, pruned_loss=0.06389, over 21735.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.283, pruned_loss=0.06342, over 4262672.87 frames. ], batch size: 415, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 20:40:03,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1892076.0, ans=0.0 2023-06-27 20:40:51,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1892196.0, ans=0.125 2023-06-27 20:41:05,388 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:41:36,682 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.431e+02 7.002e+02 1.054e+03 1.542e+03 5.604e+03, threshold=2.109e+03, percent-clipped=11.0 2023-06-27 20:41:43,660 INFO [train.py:996] (3/4) Epoch 11, batch 10450, loss[loss=0.317, simple_loss=0.3898, pruned_loss=0.1221, over 21389.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2906, pruned_loss=0.06708, over 4264876.60 frames. ], batch size: 507, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:42:39,941 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=22.5 2023-06-27 20:42:43,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.79 vs. limit=22.5 2023-06-27 20:42:59,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1892556.0, ans=0.0 2023-06-27 20:43:02,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1892556.0, ans=0.125 2023-06-27 20:43:10,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1892616.0, ans=0.125 2023-06-27 20:43:35,357 INFO [train.py:996] (3/4) Epoch 11, batch 10500, loss[loss=0.173, simple_loss=0.2484, pruned_loss=0.04882, over 21545.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2888, pruned_loss=0.0657, over 4266440.01 frames. ], batch size: 230, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:43:38,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1892676.0, ans=0.1 2023-06-27 20:43:42,166 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 20:44:03,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1892736.0, ans=0.125 2023-06-27 20:44:23,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1892796.0, ans=0.125 2023-06-27 20:45:06,707 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.280e+02 6.335e+02 9.197e+02 1.411e+03 2.954e+03, threshold=1.839e+03, percent-clipped=7.0 2023-06-27 20:45:11,719 INFO [train.py:996] (3/4) Epoch 11, batch 10550, loss[loss=0.183, simple_loss=0.2498, pruned_loss=0.05807, over 21576.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2829, pruned_loss=0.0643, over 4269553.97 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:46:01,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1893096.0, ans=0.0 2023-06-27 20:46:06,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1893096.0, ans=0.0 2023-06-27 20:46:10,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1893156.0, ans=0.125 2023-06-27 20:46:13,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1893156.0, ans=0.0 2023-06-27 20:47:00,279 INFO [train.py:996] (3/4) Epoch 11, batch 10600, loss[loss=0.1796, simple_loss=0.2596, pruned_loss=0.04981, over 21265.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2783, pruned_loss=0.063, over 4268335.36 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:47:09,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1893276.0, ans=0.0 2023-06-27 20:47:12,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1893276.0, ans=0.0 2023-06-27 20:47:32,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1893336.0, ans=0.2 2023-06-27 20:47:34,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1893336.0, ans=0.125 2023-06-27 20:47:42,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1893396.0, ans=0.04949747468305833 2023-06-27 20:48:44,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1893516.0, ans=0.0 2023-06-27 20:48:45,439 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.137e+02 6.706e+02 1.104e+03 1.395e+03 2.716e+03, threshold=2.208e+03, percent-clipped=10.0 2023-06-27 20:48:50,912 INFO [train.py:996] (3/4) Epoch 11, batch 10650, loss[loss=0.1639, simple_loss=0.2379, pruned_loss=0.04493, over 21358.00 frames. ], tot_loss[loss=0.2016, simple_loss=0.2803, pruned_loss=0.06145, over 4268309.69 frames. ], batch size: 194, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:49:10,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.08 vs. limit=15.0 2023-06-27 20:49:18,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=22.5 2023-06-27 20:49:51,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1893756.0, ans=0.0 2023-06-27 20:49:53,455 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-27 20:49:54,773 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1893756.0, ans=0.0 2023-06-27 20:50:11,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1893756.0, ans=10.0 2023-06-27 20:50:31,109 INFO [train.py:996] (3/4) Epoch 11, batch 10700, loss[loss=0.229, simple_loss=0.3073, pruned_loss=0.07533, over 21222.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.281, pruned_loss=0.06194, over 4264589.17 frames. ], batch size: 143, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:50:50,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1893936.0, ans=0.125 2023-06-27 20:51:32,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1894056.0, ans=0.125 2023-06-27 20:51:59,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1894116.0, ans=0.2 2023-06-27 20:52:10,064 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.693e+02 7.296e+02 1.063e+03 2.005e+03 4.294e+03, threshold=2.126e+03, percent-clipped=18.0 2023-06-27 20:52:14,952 INFO [train.py:996] (3/4) Epoch 11, batch 10750, loss[loss=0.227, simple_loss=0.336, pruned_loss=0.05902, over 21303.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2896, pruned_loss=0.06492, over 4267083.24 frames. ], batch size: 548, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 20:52:40,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1894236.0, ans=0.125 2023-06-27 20:53:46,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1894416.0, ans=0.1 2023-06-27 20:53:52,915 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-27 20:53:55,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1894416.0, ans=0.04949747468305833 2023-06-27 20:54:00,256 INFO [train.py:996] (3/4) Epoch 11, batch 10800, loss[loss=0.263, simple_loss=0.3334, pruned_loss=0.09633, over 21250.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2939, pruned_loss=0.06558, over 4266748.16 frames. ], batch size: 143, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:54:00,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1894476.0, ans=0.125 2023-06-27 20:54:19,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1894476.0, ans=0.035 2023-06-27 20:55:16,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-27 20:55:27,653 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-27 20:55:38,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 6.646e+02 1.015e+03 1.682e+03 4.029e+03, threshold=2.031e+03, percent-clipped=15.0 2023-06-27 20:55:42,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1894776.0, ans=0.125 2023-06-27 20:55:42,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1894776.0, ans=0.0 2023-06-27 20:55:43,149 INFO [train.py:996] (3/4) Epoch 11, batch 10850, loss[loss=0.187, simple_loss=0.2553, pruned_loss=0.05937, over 21214.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.295, pruned_loss=0.06611, over 4263871.38 frames. ], batch size: 549, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:55:44,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1894776.0, ans=0.2 2023-06-27 20:56:22,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1894836.0, ans=10.0 2023-06-27 20:56:59,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1894956.0, ans=0.125 2023-06-27 20:57:27,846 INFO [train.py:996] (3/4) Epoch 11, batch 10900, loss[loss=0.176, simple_loss=0.2583, pruned_loss=0.04683, over 15783.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2881, pruned_loss=0.06452, over 4254831.16 frames. ], batch size: 61, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:58:19,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1895196.0, ans=0.125 2023-06-27 20:58:56,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1895316.0, ans=0.125 2023-06-27 20:58:59,316 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.887e+02 5.631e+02 8.269e+02 1.201e+03 2.087e+03, threshold=1.654e+03, percent-clipped=2.0 2023-06-27 20:58:59,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1895316.0, ans=0.125 2023-06-27 20:58:59,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1895316.0, ans=0.125 2023-06-27 20:59:04,317 INFO [train.py:996] (3/4) Epoch 11, batch 10950, loss[loss=0.1682, simple_loss=0.2419, pruned_loss=0.04723, over 21647.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2824, pruned_loss=0.06259, over 4257842.78 frames. ], batch size: 333, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 20:59:14,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1895376.0, ans=0.125 2023-06-27 20:59:57,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1895496.0, ans=15.0 2023-06-27 21:00:11,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1895496.0, ans=0.0 2023-06-27 21:00:12,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1895556.0, ans=0.0 2023-06-27 21:00:36,904 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-27 21:00:38,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1895616.0, ans=0.125 2023-06-27 21:00:51,814 INFO [train.py:996] (3/4) Epoch 11, batch 11000, loss[loss=0.2168, simple_loss=0.2916, pruned_loss=0.07098, over 21524.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2823, pruned_loss=0.06331, over 4263226.26 frames. ], batch size: 131, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:01:13,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1895736.0, ans=0.2 2023-06-27 21:01:50,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1895796.0, ans=0.125 2023-06-27 21:02:15,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1895916.0, ans=0.035 2023-06-27 21:02:23,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1895916.0, ans=0.0 2023-06-27 21:02:24,709 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.482e+02 6.416e+02 9.382e+02 1.390e+03 3.598e+03, threshold=1.876e+03, percent-clipped=17.0 2023-06-27 21:02:28,491 INFO [train.py:996] (3/4) Epoch 11, batch 11050, loss[loss=0.2037, simple_loss=0.2616, pruned_loss=0.07286, over 21373.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2811, pruned_loss=0.06519, over 4259701.90 frames. ], batch size: 144, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 21:04:05,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1896216.0, ans=0.1 2023-06-27 21:04:16,371 INFO [train.py:996] (3/4) Epoch 11, batch 11100, loss[loss=0.2007, simple_loss=0.2697, pruned_loss=0.06591, over 14919.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2796, pruned_loss=0.0655, over 4255595.69 frames. ], batch size: 61, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 21:04:20,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1896276.0, ans=0.125 2023-06-27 21:04:54,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1896336.0, ans=0.1 2023-06-27 21:05:09,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1896396.0, ans=0.0 2023-06-27 21:05:14,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1896396.0, ans=0.125 2023-06-27 21:05:55,701 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.559e+02 5.976e+02 8.582e+02 1.481e+03 2.937e+03, threshold=1.716e+03, percent-clipped=16.0 2023-06-27 21:05:58,936 INFO [train.py:996] (3/4) Epoch 11, batch 11150, loss[loss=0.2242, simple_loss=0.3215, pruned_loss=0.0635, over 21746.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2786, pruned_loss=0.06548, over 4249927.38 frames. ], batch size: 351, lr: 2.67e-03, grad_scale: 8.0 2023-06-27 21:06:03,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.74 vs. limit=6.0 2023-06-27 21:07:05,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1896756.0, ans=0.2 2023-06-27 21:07:40,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1896816.0, ans=22.5 2023-06-27 21:07:42,371 INFO [train.py:996] (3/4) Epoch 11, batch 11200, loss[loss=0.1871, simple_loss=0.2485, pruned_loss=0.06288, over 21752.00 frames. ], tot_loss[loss=0.2033, simple_loss=0.2771, pruned_loss=0.06474, over 4246413.54 frames. ], batch size: 112, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:08:08,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1896936.0, ans=0.1 2023-06-27 21:08:26,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1896936.0, ans=0.95 2023-06-27 21:08:26,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1896936.0, ans=0.125 2023-06-27 21:09:03,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1897116.0, ans=0.2 2023-06-27 21:09:20,938 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.297e+02 6.349e+02 8.470e+02 1.226e+03 2.540e+03, threshold=1.694e+03, percent-clipped=7.0 2023-06-27 21:09:24,605 INFO [train.py:996] (3/4) Epoch 11, batch 11250, loss[loss=0.1934, simple_loss=0.272, pruned_loss=0.05736, over 21563.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.278, pruned_loss=0.06482, over 4257404.80 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:09:43,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1897176.0, ans=0.2 2023-06-27 21:09:46,558 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:10:04,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1897236.0, ans=0.0 2023-06-27 21:10:12,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1897296.0, ans=0.0 2023-06-27 21:10:15,156 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1897296.0, ans=0.125 2023-06-27 21:10:22,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1897296.0, ans=0.125 2023-06-27 21:10:44,147 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1897416.0, ans=0.125 2023-06-27 21:11:06,531 INFO [train.py:996] (3/4) Epoch 11, batch 11300, loss[loss=0.1927, simple_loss=0.2754, pruned_loss=0.05505, over 21750.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2783, pruned_loss=0.06444, over 4269269.55 frames. ], batch size: 112, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:11:51,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1897536.0, ans=0.125 2023-06-27 21:12:07,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1897596.0, ans=0.2 2023-06-27 21:12:44,923 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.908e+02 6.761e+02 9.436e+02 1.469e+03 2.612e+03, threshold=1.887e+03, percent-clipped=16.0 2023-06-27 21:12:48,347 INFO [train.py:996] (3/4) Epoch 11, batch 11350, loss[loss=0.2453, simple_loss=0.3229, pruned_loss=0.08387, over 21380.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.279, pruned_loss=0.06388, over 4270999.94 frames. ], batch size: 548, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:12:54,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1897776.0, ans=0.2 2023-06-27 21:13:29,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1897836.0, ans=0.1 2023-06-27 21:13:49,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1897896.0, ans=0.2 2023-06-27 21:14:02,633 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-27 21:14:18,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1898016.0, ans=0.125 2023-06-27 21:14:26,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1898016.0, ans=0.1 2023-06-27 21:14:30,957 INFO [train.py:996] (3/4) Epoch 11, batch 11400, loss[loss=0.2102, simple_loss=0.282, pruned_loss=0.06922, over 20020.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2838, pruned_loss=0.06553, over 4277936.41 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:15:13,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1898136.0, ans=0.0 2023-06-27 21:15:25,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1898196.0, ans=0.125 2023-06-27 21:15:43,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1898256.0, ans=15.0 2023-06-27 21:15:47,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1898256.0, ans=0.1 2023-06-27 21:16:04,705 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-27 21:16:09,914 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.866e+02 7.031e+02 1.006e+03 1.495e+03 2.656e+03, threshold=2.011e+03, percent-clipped=10.0 2023-06-27 21:16:23,426 INFO [train.py:996] (3/4) Epoch 11, batch 11450, loss[loss=0.1858, simple_loss=0.2509, pruned_loss=0.06038, over 21214.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2836, pruned_loss=0.06389, over 4270609.73 frames. ], batch size: 608, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:17:13,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1898496.0, ans=0.0 2023-06-27 21:17:17,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1898556.0, ans=0.125 2023-06-27 21:17:22,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.80 vs. limit=6.0 2023-06-27 21:17:23,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1898556.0, ans=0.1 2023-06-27 21:17:35,190 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-27 21:17:51,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1898616.0, ans=0.0 2023-06-27 21:18:06,601 INFO [train.py:996] (3/4) Epoch 11, batch 11500, loss[loss=0.1699, simple_loss=0.2223, pruned_loss=0.05879, over 20885.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2868, pruned_loss=0.06496, over 4271200.74 frames. ], batch size: 608, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:18:13,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1898676.0, ans=0.0 2023-06-27 21:18:22,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1898676.0, ans=0.1 2023-06-27 21:18:35,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1898736.0, ans=0.125 2023-06-27 21:18:51,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1898796.0, ans=0.125 2023-06-27 21:19:15,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1898856.0, ans=0.125 2023-06-27 21:19:16,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1898856.0, ans=0.0 2023-06-27 21:19:23,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1898916.0, ans=0.125 2023-06-27 21:19:33,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1898916.0, ans=0.1 2023-06-27 21:19:48,636 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.762e+02 6.937e+02 1.166e+03 1.634e+03 3.269e+03, threshold=2.333e+03, percent-clipped=13.0 2023-06-27 21:19:52,373 INFO [train.py:996] (3/4) Epoch 11, batch 11550, loss[loss=0.2895, simple_loss=0.4127, pruned_loss=0.08316, over 21181.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2943, pruned_loss=0.06574, over 4271833.81 frames. ], batch size: 548, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:19:54,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1898976.0, ans=0.0 2023-06-27 21:21:23,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1899216.0, ans=0.1 2023-06-27 21:21:35,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1899216.0, ans=0.0 2023-06-27 21:21:38,051 INFO [train.py:996] (3/4) Epoch 11, batch 11600, loss[loss=0.2412, simple_loss=0.3388, pruned_loss=0.07177, over 21571.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3068, pruned_loss=0.06716, over 4266394.71 frames. ], batch size: 230, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 21:21:46,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1899276.0, ans=0.0 2023-06-27 21:22:42,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1899456.0, ans=0.07 2023-06-27 21:22:46,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-27 21:22:47,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1899456.0, ans=0.1 2023-06-27 21:23:15,077 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.239e+02 7.630e+02 1.389e+03 2.274e+03 4.713e+03, threshold=2.778e+03, percent-clipped=21.0 2023-06-27 21:23:16,793 INFO [train.py:996] (3/4) Epoch 11, batch 11650, loss[loss=0.2182, simple_loss=0.3011, pruned_loss=0.06765, over 21846.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3139, pruned_loss=0.06866, over 4262803.99 frames. ], batch size: 372, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:23:34,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1899636.0, ans=0.125 2023-06-27 21:23:38,877 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:24:37,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1899816.0, ans=0.05 2023-06-27 21:24:48,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-27 21:24:53,652 INFO [train.py:996] (3/4) Epoch 11, batch 11700, loss[loss=0.1833, simple_loss=0.2472, pruned_loss=0.0597, over 21588.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3051, pruned_loss=0.06847, over 4255126.50 frames. ], batch size: 213, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:25:04,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1899876.0, ans=0.125 2023-06-27 21:25:13,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1899936.0, ans=0.04949747468305833 2023-06-27 21:25:39,668 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-27 21:25:42,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1899996.0, ans=0.2 2023-06-27 21:26:01,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1900056.0, ans=0.2 2023-06-27 21:26:04,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1900056.0, ans=0.125 2023-06-27 21:26:28,328 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.521e+02 7.314e+02 1.090e+03 1.615e+03 2.478e+03, threshold=2.180e+03, percent-clipped=0.0 2023-06-27 21:26:29,976 INFO [train.py:996] (3/4) Epoch 11, batch 11750, loss[loss=0.2086, simple_loss=0.2815, pruned_loss=0.06785, over 21745.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2964, pruned_loss=0.06789, over 4254156.08 frames. ], batch size: 282, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:26:45,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1900236.0, ans=0.125 2023-06-27 21:26:51,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1900236.0, ans=0.125 2023-06-27 21:27:31,215 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:27:59,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1900416.0, ans=0.0 2023-06-27 21:28:08,580 INFO [train.py:996] (3/4) Epoch 11, batch 11800, loss[loss=0.2126, simple_loss=0.3177, pruned_loss=0.05375, over 21891.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2976, pruned_loss=0.06968, over 4261928.15 frames. ], batch size: 372, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:28:09,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1900476.0, ans=0.0 2023-06-27 21:28:17,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1900476.0, ans=0.0 2023-06-27 21:28:56,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1900596.0, ans=0.0 2023-06-27 21:29:30,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1900716.0, ans=0.125 2023-06-27 21:29:43,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1900716.0, ans=0.125 2023-06-27 21:29:44,971 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.871e+02 7.084e+02 9.791e+02 1.465e+03 2.454e+03, threshold=1.958e+03, percent-clipped=4.0 2023-06-27 21:29:46,621 INFO [train.py:996] (3/4) Epoch 11, batch 11850, loss[loss=0.2016, simple_loss=0.292, pruned_loss=0.05564, over 21763.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2985, pruned_loss=0.06848, over 4271468.72 frames. ], batch size: 247, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:30:29,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1900896.0, ans=0.1 2023-06-27 21:31:25,868 INFO [train.py:996] (3/4) Epoch 11, batch 11900, loss[loss=0.208, simple_loss=0.3057, pruned_loss=0.0551, over 21216.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2982, pruned_loss=0.06598, over 4265100.07 frames. ], batch size: 548, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:31:47,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1901136.0, ans=0.125 2023-06-27 21:31:48,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1901136.0, ans=0.0 2023-06-27 21:33:08,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.313e+02 5.634e+02 7.548e+02 1.171e+03 3.128e+03, threshold=1.510e+03, percent-clipped=7.0 2023-06-27 21:33:14,926 INFO [train.py:996] (3/4) Epoch 11, batch 11950, loss[loss=0.1959, simple_loss=0.2981, pruned_loss=0.04687, over 21723.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2993, pruned_loss=0.06343, over 4265602.54 frames. ], batch size: 351, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:33:30,239 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:33:56,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1901496.0, ans=0.125 2023-06-27 21:34:06,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1901496.0, ans=0.125 2023-06-27 21:34:32,190 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.43 vs. limit=22.5 2023-06-27 21:34:52,070 INFO [train.py:996] (3/4) Epoch 11, batch 12000, loss[loss=0.1863, simple_loss=0.2572, pruned_loss=0.0577, over 21508.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.293, pruned_loss=0.0616, over 4261987.93 frames. ], batch size: 212, lr: 2.67e-03, grad_scale: 32.0 2023-06-27 21:34:52,070 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 21:35:12,137 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2616, simple_loss=0.3513, pruned_loss=0.08594, over 1796401.00 frames. 2023-06-27 21:35:12,138 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 21:35:31,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1901676.0, ans=0.07 2023-06-27 21:35:51,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1901736.0, ans=0.125 2023-06-27 21:36:54,304 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1901976.0, ans=0.0 2023-06-27 21:36:59,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.508e+02 6.629e+02 1.038e+03 1.682e+03 4.454e+03, threshold=2.077e+03, percent-clipped=31.0 2023-06-27 21:36:59,900 INFO [train.py:996] (3/4) Epoch 11, batch 12050, loss[loss=0.2127, simple_loss=0.2806, pruned_loss=0.07235, over 21914.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2898, pruned_loss=0.06366, over 4271421.74 frames. ], batch size: 316, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:37:53,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1902156.0, ans=0.0 2023-06-27 21:38:39,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1902216.0, ans=0.125 2023-06-27 21:38:43,463 INFO [train.py:996] (3/4) Epoch 11, batch 12100, loss[loss=0.3104, simple_loss=0.3642, pruned_loss=0.1284, over 21366.00 frames. ], tot_loss[loss=0.2142, simple_loss=0.2937, pruned_loss=0.06734, over 4271083.79 frames. ], batch size: 507, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:39:08,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1902336.0, ans=0.0 2023-06-27 21:39:29,385 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1902396.0, ans=0.0 2023-06-27 21:39:34,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1902456.0, ans=0.125 2023-06-27 21:40:24,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1902516.0, ans=0.0 2023-06-27 21:40:29,118 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.049e+02 8.933e+02 1.358e+03 2.118e+03 4.417e+03, threshold=2.716e+03, percent-clipped=26.0 2023-06-27 21:40:29,161 INFO [train.py:996] (3/4) Epoch 11, batch 12150, loss[loss=0.191, simple_loss=0.2995, pruned_loss=0.0413, over 20831.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2973, pruned_loss=0.06642, over 4269613.95 frames. ], batch size: 607, lr: 2.67e-03, grad_scale: 16.0 2023-06-27 21:40:29,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1902576.0, ans=0.125 2023-06-27 21:40:41,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1902576.0, ans=0.1 2023-06-27 21:41:11,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1902696.0, ans=0.1 2023-06-27 21:41:33,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1902756.0, ans=0.125 2023-06-27 21:42:09,509 INFO [train.py:996] (3/4) Epoch 11, batch 12200, loss[loss=0.1957, simple_loss=0.2597, pruned_loss=0.06589, over 21566.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.294, pruned_loss=0.06561, over 4262205.53 frames. ], batch size: 231, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:42:25,603 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1902936.0, ans=10.0 2023-06-27 21:43:20,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1903056.0, ans=0.125 2023-06-27 21:43:50,745 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 6.388e+02 1.060e+03 1.811e+03 4.082e+03, threshold=2.119e+03, percent-clipped=7.0 2023-06-27 21:43:50,775 INFO [train.py:996] (3/4) Epoch 11, batch 12250, loss[loss=0.1677, simple_loss=0.2521, pruned_loss=0.04161, over 21654.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2862, pruned_loss=0.06288, over 4263755.65 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:44:01,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-27 21:44:06,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1903236.0, ans=0.125 2023-06-27 21:44:23,883 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1903296.0, ans=0.125 2023-06-27 21:44:28,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-27 21:45:30,433 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.61 vs. limit=15.0 2023-06-27 21:45:34,156 INFO [train.py:996] (3/4) Epoch 11, batch 12300, loss[loss=0.2253, simple_loss=0.3184, pruned_loss=0.06611, over 21751.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2796, pruned_loss=0.0584, over 4267655.70 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:45:50,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1903536.0, ans=0.0 2023-06-27 21:46:08,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1903596.0, ans=0.0 2023-06-27 21:46:35,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.74 vs. limit=15.0 2023-06-27 21:46:49,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1903656.0, ans=0.0 2023-06-27 21:47:16,522 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.514e+02 6.456e+02 1.093e+03 1.764e+03 5.046e+03, threshold=2.186e+03, percent-clipped=16.0 2023-06-27 21:47:16,568 INFO [train.py:996] (3/4) Epoch 11, batch 12350, loss[loss=0.208, simple_loss=0.2812, pruned_loss=0.06743, over 21602.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2853, pruned_loss=0.05947, over 4272614.50 frames. ], batch size: 195, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:47:20,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1903776.0, ans=0.0 2023-06-27 21:47:28,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1903776.0, ans=0.125 2023-06-27 21:47:30,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1903776.0, ans=0.0 2023-06-27 21:48:13,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1903956.0, ans=0.2 2023-06-27 21:48:21,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1903956.0, ans=0.1 2023-06-27 21:48:57,239 INFO [train.py:996] (3/4) Epoch 11, batch 12400, loss[loss=0.2345, simple_loss=0.3017, pruned_loss=0.08362, over 21731.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.287, pruned_loss=0.06268, over 4280563.68 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 21:48:58,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=15.0 2023-06-27 21:49:05,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1904076.0, ans=15.0 2023-06-27 21:49:27,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1904136.0, ans=0.125 2023-06-27 21:49:49,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1904196.0, ans=0.2 2023-06-27 21:49:54,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1904196.0, ans=22.5 2023-06-27 21:50:09,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1904256.0, ans=0.125 2023-06-27 21:50:39,854 INFO [train.py:996] (3/4) Epoch 11, batch 12450, loss[loss=0.2595, simple_loss=0.3362, pruned_loss=0.09139, over 21343.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2909, pruned_loss=0.06577, over 4278821.27 frames. ], batch size: 143, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:50:41,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.549e+02 6.531e+02 8.502e+02 1.313e+03 3.916e+03, threshold=1.700e+03, percent-clipped=4.0 2023-06-27 21:51:56,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1904556.0, ans=0.2 2023-06-27 21:52:08,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1904616.0, ans=0.2 2023-06-27 21:52:13,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1904616.0, ans=0.0 2023-06-27 21:52:29,344 INFO [train.py:996] (3/4) Epoch 11, batch 12500, loss[loss=0.2406, simple_loss=0.3376, pruned_loss=0.07175, over 21942.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.3031, pruned_loss=0.0692, over 4277203.94 frames. ], batch size: 317, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:52:30,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1904676.0, ans=0.0 2023-06-27 21:53:15,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1904796.0, ans=0.1 2023-06-27 21:53:19,863 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-27 21:53:21,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1904796.0, ans=0.2 2023-06-27 21:53:35,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1904856.0, ans=0.0 2023-06-27 21:54:07,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1904916.0, ans=0.125 2023-06-27 21:54:10,059 INFO [train.py:996] (3/4) Epoch 11, batch 12550, loss[loss=0.2237, simple_loss=0.3031, pruned_loss=0.07216, over 21336.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3067, pruned_loss=0.07093, over 4281330.67 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:54:11,832 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.997e+02 7.151e+02 9.738e+02 1.410e+03 2.995e+03, threshold=1.948e+03, percent-clipped=12.0 2023-06-27 21:54:14,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1904976.0, ans=0.125 2023-06-27 21:54:41,894 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=15.0 2023-06-27 21:54:51,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1905036.0, ans=0.125 2023-06-27 21:55:13,190 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 21:55:23,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=22.5 2023-06-27 21:55:34,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1905216.0, ans=0.125 2023-06-27 21:55:53,719 INFO [train.py:996] (3/4) Epoch 11, batch 12600, loss[loss=0.1826, simple_loss=0.2737, pruned_loss=0.0458, over 21507.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.3059, pruned_loss=0.06892, over 4274040.61 frames. ], batch size: 195, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:56:33,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1905396.0, ans=0.0 2023-06-27 21:57:30,665 INFO [train.py:996] (3/4) Epoch 11, batch 12650, loss[loss=0.2181, simple_loss=0.2852, pruned_loss=0.07547, over 21860.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2988, pruned_loss=0.06557, over 4274186.32 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 21:57:36,981 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.857e+02 6.084e+02 8.925e+02 1.601e+03 4.127e+03, threshold=1.785e+03, percent-clipped=11.0 2023-06-27 21:57:39,748 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-27 21:57:40,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1905576.0, ans=0.125 2023-06-27 21:57:51,144 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.96 vs. limit=22.5 2023-06-27 21:57:59,058 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=12.0 2023-06-27 21:58:01,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1905636.0, ans=0.1 2023-06-27 21:58:03,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1905636.0, ans=0.1 2023-06-27 21:58:15,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1905696.0, ans=0.0 2023-06-27 21:58:41,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1905756.0, ans=0.0 2023-06-27 21:58:53,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1905816.0, ans=0.125 2023-06-27 21:58:56,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1905816.0, ans=0.0 2023-06-27 21:59:17,357 INFO [train.py:996] (3/4) Epoch 11, batch 12700, loss[loss=0.2257, simple_loss=0.307, pruned_loss=0.07216, over 21467.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2965, pruned_loss=0.06707, over 4279940.30 frames. ], batch size: 131, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:00:51,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-27 22:00:59,914 INFO [train.py:996] (3/4) Epoch 11, batch 12750, loss[loss=0.2363, simple_loss=0.3085, pruned_loss=0.08205, over 21890.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2973, pruned_loss=0.0673, over 4278809.21 frames. ], batch size: 107, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:01:03,057 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.884e+02 6.384e+02 9.703e+02 1.626e+03 3.460e+03, threshold=1.941e+03, percent-clipped=17.0 2023-06-27 22:01:26,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1906236.0, ans=0.1 2023-06-27 22:02:09,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1906416.0, ans=0.125 2023-06-27 22:02:13,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1906416.0, ans=0.2 2023-06-27 22:02:26,548 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-27 22:02:35,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-27 22:02:37,443 INFO [train.py:996] (3/4) Epoch 11, batch 12800, loss[loss=0.2297, simple_loss=0.3059, pruned_loss=0.0768, over 21676.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2976, pruned_loss=0.06828, over 4288948.59 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:02:55,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1906536.0, ans=0.125 2023-06-27 22:03:08,191 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-27 22:03:15,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1906596.0, ans=0.125 2023-06-27 22:03:30,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=12.0 2023-06-27 22:04:16,512 INFO [train.py:996] (3/4) Epoch 11, batch 12850, loss[loss=0.1917, simple_loss=0.2893, pruned_loss=0.04699, over 21853.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.299, pruned_loss=0.06911, over 4289206.74 frames. ], batch size: 316, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:04:19,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 6.619e+02 8.381e+02 1.196e+03 2.769e+03, threshold=1.676e+03, percent-clipped=10.0 2023-06-27 22:04:48,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-27 22:04:59,690 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1906896.0, ans=0.125 2023-06-27 22:04:59,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1906896.0, ans=0.0 2023-06-27 22:05:13,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-27 22:05:56,183 INFO [train.py:996] (3/4) Epoch 11, batch 12900, loss[loss=0.2305, simple_loss=0.3259, pruned_loss=0.06754, over 21140.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2974, pruned_loss=0.06626, over 4284004.62 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:06:03,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1907076.0, ans=0.0 2023-06-27 22:06:18,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-27 22:07:17,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1907256.0, ans=0.125 2023-06-27 22:07:29,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1907316.0, ans=0.025 2023-06-27 22:07:38,235 INFO [train.py:996] (3/4) Epoch 11, batch 12950, loss[loss=0.1752, simple_loss=0.2582, pruned_loss=0.04606, over 21227.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2933, pruned_loss=0.06398, over 4280052.96 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:07:46,034 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.286e+02 5.649e+02 7.456e+02 9.840e+02 3.735e+03, threshold=1.491e+03, percent-clipped=7.0 2023-06-27 22:08:35,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1907496.0, ans=0.0 2023-06-27 22:08:45,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1907496.0, ans=0.0 2023-06-27 22:08:58,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1907556.0, ans=0.2 2023-06-27 22:09:15,931 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-27 22:09:19,446 INFO [train.py:996] (3/4) Epoch 11, batch 13000, loss[loss=0.1634, simple_loss=0.2468, pruned_loss=0.04005, over 21198.00 frames. ], tot_loss[loss=0.212, simple_loss=0.294, pruned_loss=0.06501, over 4279042.54 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:10:18,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1907796.0, ans=0.125 2023-06-27 22:10:20,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1907796.0, ans=0.125 2023-06-27 22:10:25,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1907796.0, ans=0.0 2023-06-27 22:10:52,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1907916.0, ans=0.125 2023-06-27 22:11:06,074 INFO [train.py:996] (3/4) Epoch 11, batch 13050, loss[loss=0.1984, simple_loss=0.2666, pruned_loss=0.06509, over 21855.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2903, pruned_loss=0.06353, over 4275276.19 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:11:09,261 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.155e+02 7.836e+02 1.180e+03 1.629e+03 3.232e+03, threshold=2.361e+03, percent-clipped=34.0 2023-06-27 22:11:29,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1908036.0, ans=0.015 2023-06-27 22:12:00,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=15.0 2023-06-27 22:12:19,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-27 22:12:22,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1908216.0, ans=0.125 2023-06-27 22:12:43,793 INFO [train.py:996] (3/4) Epoch 11, batch 13100, loss[loss=0.2586, simple_loss=0.3754, pruned_loss=0.07091, over 19773.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.293, pruned_loss=0.0634, over 4274227.68 frames. ], batch size: 703, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:12:55,263 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-27 22:13:09,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1908336.0, ans=0.0 2023-06-27 22:13:11,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1908336.0, ans=0.125 2023-06-27 22:14:01,233 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1908516.0, ans=0.035 2023-06-27 22:14:16,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1908516.0, ans=15.0 2023-06-27 22:14:18,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.99 vs. limit=8.0 2023-06-27 22:14:19,041 INFO [train.py:996] (3/4) Epoch 11, batch 13150, loss[loss=0.2512, simple_loss=0.4011, pruned_loss=0.05061, over 19709.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2976, pruned_loss=0.06631, over 4266055.28 frames. ], batch size: 702, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:14:19,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1908576.0, ans=0.125 2023-06-27 22:14:22,233 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.348e+02 6.692e+02 9.537e+02 1.354e+03 2.505e+03, threshold=1.907e+03, percent-clipped=1.0 2023-06-27 22:15:48,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1908816.0, ans=0.2 2023-06-27 22:15:58,342 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1908816.0, ans=0.1 2023-06-27 22:15:58,908 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-27 22:16:12,336 INFO [train.py:996] (3/4) Epoch 11, batch 13200, loss[loss=0.2393, simple_loss=0.3134, pruned_loss=0.08257, over 21569.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2959, pruned_loss=0.06657, over 4263997.21 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:16:24,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1908876.0, ans=0.0 2023-06-27 22:16:28,480 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:16:44,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1908996.0, ans=0.125 2023-06-27 22:16:49,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1908996.0, ans=0.125 2023-06-27 22:16:49,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1908996.0, ans=0.0 2023-06-27 22:17:30,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1909116.0, ans=0.1 2023-06-27 22:17:50,825 INFO [train.py:996] (3/4) Epoch 11, batch 13250, loss[loss=0.1969, simple_loss=0.2836, pruned_loss=0.05515, over 21654.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2958, pruned_loss=0.06854, over 4264167.41 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:17:53,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1909176.0, ans=0.025 2023-06-27 22:17:55,790 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.059e+02 8.358e+02 1.341e+03 1.799e+03 2.954e+03, threshold=2.682e+03, percent-clipped=21.0 2023-06-27 22:18:03,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1909176.0, ans=0.125 2023-06-27 22:18:07,202 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-27 22:18:11,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1909236.0, ans=0.0 2023-06-27 22:18:19,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1909236.0, ans=0.0 2023-06-27 22:18:26,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1909296.0, ans=0.125 2023-06-27 22:18:36,816 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-27 22:18:49,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1909356.0, ans=0.0 2023-06-27 22:18:51,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1909356.0, ans=0.0 2023-06-27 22:18:55,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.71 vs. limit=15.0 2023-06-27 22:19:00,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1909356.0, ans=0.1 2023-06-27 22:19:31,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1909416.0, ans=0.0 2023-06-27 22:19:34,377 INFO [train.py:996] (3/4) Epoch 11, batch 13300, loss[loss=0.2303, simple_loss=0.3132, pruned_loss=0.07372, over 21594.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2978, pruned_loss=0.06801, over 4267026.56 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:19:35,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1909476.0, ans=0.125 2023-06-27 22:19:42,284 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-27 22:20:26,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1909596.0, ans=0.0 2023-06-27 22:20:31,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-27 22:20:46,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1909656.0, ans=0.125 2023-06-27 22:21:18,947 INFO [train.py:996] (3/4) Epoch 11, batch 13350, loss[loss=0.2571, simple_loss=0.3363, pruned_loss=0.08895, over 21743.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.3018, pruned_loss=0.0709, over 4272019.52 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:21:23,517 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.59 vs. limit=8.0 2023-06-27 22:21:23,889 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 8.555e+02 1.217e+03 1.843e+03 4.034e+03, threshold=2.434e+03, percent-clipped=8.0 2023-06-27 22:23:00,810 INFO [train.py:996] (3/4) Epoch 11, batch 13400, loss[loss=0.2227, simple_loss=0.2991, pruned_loss=0.07319, over 21825.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3018, pruned_loss=0.07161, over 4279801.00 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:23:08,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1910076.0, ans=0.125 2023-06-27 22:23:37,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1910196.0, ans=0.2 2023-06-27 22:24:20,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1910256.0, ans=0.2 2023-06-27 22:24:43,493 INFO [train.py:996] (3/4) Epoch 11, batch 13450, loss[loss=0.1892, simple_loss=0.2641, pruned_loss=0.05721, over 21722.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3037, pruned_loss=0.07282, over 4281579.53 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:24:52,951 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 6.376e+02 8.067e+02 1.099e+03 2.577e+03, threshold=1.613e+03, percent-clipped=1.0 2023-06-27 22:26:10,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1910616.0, ans=0.1 2023-06-27 22:26:12,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1910616.0, ans=0.2 2023-06-27 22:26:15,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1910616.0, ans=0.125 2023-06-27 22:26:31,878 INFO [train.py:996] (3/4) Epoch 11, batch 13500, loss[loss=0.1718, simple_loss=0.2405, pruned_loss=0.05152, over 21596.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2941, pruned_loss=0.06969, over 4272873.90 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:26:39,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1910676.0, ans=0.0 2023-06-27 22:26:51,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1910736.0, ans=0.0 2023-06-27 22:26:51,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1910736.0, ans=0.125 2023-06-27 22:27:07,657 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-27 22:27:37,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1910856.0, ans=0.1 2023-06-27 22:27:53,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1910916.0, ans=0.125 2023-06-27 22:27:57,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-06-27 22:28:10,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1910976.0, ans=0.0 2023-06-27 22:28:11,315 INFO [train.py:996] (3/4) Epoch 11, batch 13550, loss[loss=0.2237, simple_loss=0.3102, pruned_loss=0.06853, over 21419.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.2984, pruned_loss=0.06948, over 4264558.62 frames. ], batch size: 131, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:28:16,103 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.678e+02 8.007e+02 1.277e+03 1.961e+03 4.546e+03, threshold=2.554e+03, percent-clipped=33.0 2023-06-27 22:28:39,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1911036.0, ans=0.125 2023-06-27 22:28:51,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1911036.0, ans=0.0 2023-06-27 22:29:07,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1911096.0, ans=0.125 2023-06-27 22:29:53,211 INFO [train.py:996] (3/4) Epoch 11, batch 13600, loss[loss=0.2146, simple_loss=0.2956, pruned_loss=0.06678, over 21834.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2984, pruned_loss=0.06971, over 4276067.36 frames. ], batch size: 124, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:30:02,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1911276.0, ans=0.0 2023-06-27 22:30:05,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1911276.0, ans=0.125 2023-06-27 22:30:41,197 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1911396.0, ans=0.125 2023-06-27 22:30:41,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.99 vs. limit=22.5 2023-06-27 22:30:44,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1911396.0, ans=0.09899494936611666 2023-06-27 22:30:45,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1911396.0, ans=0.5 2023-06-27 22:30:58,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1911456.0, ans=0.0 2023-06-27 22:31:34,439 INFO [train.py:996] (3/4) Epoch 11, batch 13650, loss[loss=0.1706, simple_loss=0.2455, pruned_loss=0.04786, over 21662.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2938, pruned_loss=0.06745, over 4273909.49 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:31:36,856 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1911576.0, ans=0.125 2023-06-27 22:31:45,849 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.126e+02 5.497e+02 8.318e+02 1.364e+03 3.376e+03, threshold=1.664e+03, percent-clipped=5.0 2023-06-27 22:31:54,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1911636.0, ans=0.0 2023-06-27 22:31:54,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1911636.0, ans=0.125 2023-06-27 22:32:49,547 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-27 22:33:00,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1911816.0, ans=0.125 2023-06-27 22:33:13,812 INFO [train.py:996] (3/4) Epoch 11, batch 13700, loss[loss=0.295, simple_loss=0.3649, pruned_loss=0.1126, over 21507.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2927, pruned_loss=0.06713, over 4264127.32 frames. ], batch size: 508, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:33:53,609 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:33:58,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1911936.0, ans=0.0 2023-06-27 22:34:17,872 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-27 22:34:32,149 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1912056.0, ans=0.1 2023-06-27 22:35:01,857 INFO [train.py:996] (3/4) Epoch 11, batch 13750, loss[loss=0.2165, simple_loss=0.3132, pruned_loss=0.0599, over 21166.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2898, pruned_loss=0.06621, over 4263978.08 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:35:13,331 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.554e+02 7.211e+02 1.142e+03 1.644e+03 3.975e+03, threshold=2.283e+03, percent-clipped=24.0 2023-06-27 22:35:38,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1912236.0, ans=0.1 2023-06-27 22:35:46,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1912296.0, ans=0.0 2023-06-27 22:36:25,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1912416.0, ans=0.125 2023-06-27 22:36:36,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1912416.0, ans=0.125 2023-06-27 22:36:39,115 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.80 vs. limit=10.0 2023-06-27 22:36:47,740 INFO [train.py:996] (3/4) Epoch 11, batch 13800, loss[loss=0.2385, simple_loss=0.3438, pruned_loss=0.06658, over 21704.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2923, pruned_loss=0.06503, over 4257239.17 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:36:59,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1912476.0, ans=0.1 2023-06-27 22:37:09,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1912536.0, ans=0.125 2023-06-27 22:37:57,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1912656.0, ans=0.0 2023-06-27 22:38:31,499 INFO [train.py:996] (3/4) Epoch 11, batch 13850, loss[loss=0.2549, simple_loss=0.3373, pruned_loss=0.08625, over 21348.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.3003, pruned_loss=0.0658, over 4257006.97 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:38:34,303 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-27 22:38:38,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.536e+02 6.806e+02 9.243e+02 1.369e+03 3.206e+03, threshold=1.849e+03, percent-clipped=7.0 2023-06-27 22:38:50,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1912776.0, ans=0.0 2023-06-27 22:39:19,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1912896.0, ans=0.125 2023-06-27 22:39:30,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1912956.0, ans=0.2 2023-06-27 22:39:35,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1912956.0, ans=0.125 2023-06-27 22:39:57,855 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:40:12,061 INFO [train.py:996] (3/4) Epoch 11, batch 13900, loss[loss=0.2223, simple_loss=0.303, pruned_loss=0.07078, over 21449.00 frames. ], tot_loss[loss=0.221, simple_loss=0.304, pruned_loss=0.069, over 4259032.84 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:40:12,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1913076.0, ans=0.125 2023-06-27 22:41:49,824 INFO [train.py:996] (3/4) Epoch 11, batch 13950, loss[loss=0.2745, simple_loss=0.3518, pruned_loss=0.09863, over 21882.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3033, pruned_loss=0.07103, over 4270832.99 frames. ], batch size: 107, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:41:57,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.13 vs. limit=10.0 2023-06-27 22:42:00,997 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.608e+02 7.245e+02 1.112e+03 1.601e+03 2.924e+03, threshold=2.224e+03, percent-clipped=16.0 2023-06-27 22:42:47,664 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.24 vs. limit=15.0 2023-06-27 22:42:49,015 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1913556.0, ans=0.125 2023-06-27 22:42:49,611 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-27 22:42:54,828 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1913556.0, ans=0.0 2023-06-27 22:43:20,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1913616.0, ans=0.125 2023-06-27 22:43:29,846 INFO [train.py:996] (3/4) Epoch 11, batch 14000, loss[loss=0.1933, simple_loss=0.2944, pruned_loss=0.04607, over 21793.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.3, pruned_loss=0.06931, over 4270514.63 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:43:53,734 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-27 22:44:57,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1913916.0, ans=0.0 2023-06-27 22:45:02,053 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 22:45:15,873 INFO [train.py:996] (3/4) Epoch 11, batch 14050, loss[loss=0.2053, simple_loss=0.2736, pruned_loss=0.06846, over 21155.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2941, pruned_loss=0.06549, over 4269112.57 frames. ], batch size: 608, lr: 2.66e-03, grad_scale: 32.0 2023-06-27 22:45:22,437 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.282e+02 6.606e+02 1.013e+03 1.561e+03 3.162e+03, threshold=2.026e+03, percent-clipped=9.0 2023-06-27 22:45:25,067 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-06-27 22:45:38,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1914036.0, ans=0.1 2023-06-27 22:45:43,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1914036.0, ans=0.125 2023-06-27 22:46:30,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=22.5 2023-06-27 22:46:37,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1914216.0, ans=0.0 2023-06-27 22:46:57,111 INFO [train.py:996] (3/4) Epoch 11, batch 14100, loss[loss=0.2051, simple_loss=0.2776, pruned_loss=0.06626, over 21233.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2881, pruned_loss=0.06539, over 4262579.80 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:47:11,537 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-27 22:47:17,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-27 22:47:41,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1914396.0, ans=0.0 2023-06-27 22:48:15,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1914516.0, ans=0.125 2023-06-27 22:48:16,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1914516.0, ans=0.125 2023-06-27 22:48:21,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1914516.0, ans=0.025 2023-06-27 22:48:31,948 INFO [train.py:996] (3/4) Epoch 11, batch 14150, loss[loss=0.1889, simple_loss=0.2775, pruned_loss=0.05013, over 15625.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2914, pruned_loss=0.06594, over 4261265.63 frames. ], batch size: 60, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:48:33,114 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-27 22:48:44,403 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.197e+02 6.392e+02 8.340e+02 1.310e+03 2.692e+03, threshold=1.668e+03, percent-clipped=6.0 2023-06-27 22:48:44,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1914576.0, ans=0.1 2023-06-27 22:49:02,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1914636.0, ans=0.125 2023-06-27 22:50:06,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1914816.0, ans=0.125 2023-06-27 22:50:10,688 INFO [train.py:996] (3/4) Epoch 11, batch 14200, loss[loss=0.1841, simple_loss=0.271, pruned_loss=0.04862, over 21845.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.29, pruned_loss=0.06515, over 4261519.30 frames. ], batch size: 112, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:51:01,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1914996.0, ans=0.2 2023-06-27 22:51:20,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1915056.0, ans=0.0 2023-06-27 22:51:50,593 INFO [train.py:996] (3/4) Epoch 11, batch 14250, loss[loss=0.1805, simple_loss=0.2491, pruned_loss=0.0559, over 21364.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2856, pruned_loss=0.06519, over 4271435.72 frames. ], batch size: 144, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:51:59,420 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.518e+02 7.474e+02 9.961e+02 1.736e+03 2.961e+03, threshold=1.992e+03, percent-clipped=26.0 2023-06-27 22:53:35,434 INFO [train.py:996] (3/4) Epoch 11, batch 14300, loss[loss=0.2146, simple_loss=0.2879, pruned_loss=0.07066, over 15847.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2889, pruned_loss=0.06545, over 4256026.63 frames. ], batch size: 66, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:53:46,758 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1915476.0, ans=0.2 2023-06-27 22:54:04,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1915536.0, ans=0.0 2023-06-27 22:54:49,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1915656.0, ans=0.125 2023-06-27 22:55:09,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1915716.0, ans=0.2 2023-06-27 22:55:11,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1915716.0, ans=0.125 2023-06-27 22:55:18,182 INFO [train.py:996] (3/4) Epoch 11, batch 14350, loss[loss=0.2567, simple_loss=0.3385, pruned_loss=0.0875, over 21614.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2942, pruned_loss=0.06611, over 4260127.32 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 8.0 2023-06-27 22:55:27,947 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.575e+02 7.291e+02 1.107e+03 2.245e+03 6.428e+03, threshold=2.214e+03, percent-clipped=28.0 2023-06-27 22:55:48,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1915836.0, ans=0.125 2023-06-27 22:56:15,290 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=15.0 2023-06-27 22:56:56,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1916016.0, ans=0.2 2023-06-27 22:56:59,082 INFO [train.py:996] (3/4) Epoch 11, batch 14400, loss[loss=0.2029, simple_loss=0.2727, pruned_loss=0.06657, over 21475.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2929, pruned_loss=0.06672, over 4260896.64 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:57:37,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1916196.0, ans=0.125 2023-06-27 22:58:06,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1916256.0, ans=0.125 2023-06-27 22:58:40,444 INFO [train.py:996] (3/4) Epoch 11, batch 14450, loss[loss=0.1808, simple_loss=0.2536, pruned_loss=0.05403, over 21710.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2874, pruned_loss=0.06646, over 4266919.74 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 22:58:47,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1916376.0, ans=0.95 2023-06-27 22:58:50,301 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.736e+02 6.824e+02 1.004e+03 1.771e+03 3.739e+03, threshold=2.008e+03, percent-clipped=15.0 2023-06-27 22:59:10,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1916436.0, ans=0.1 2023-06-27 23:00:21,610 INFO [train.py:996] (3/4) Epoch 11, batch 14500, loss[loss=0.2074, simple_loss=0.2964, pruned_loss=0.05921, over 21780.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2829, pruned_loss=0.06577, over 4275727.45 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 23:00:32,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1916676.0, ans=0.05 2023-06-27 23:00:46,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1916736.0, ans=0.125 2023-06-27 23:01:05,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1916796.0, ans=0.1 2023-06-27 23:02:01,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1916916.0, ans=0.125 2023-06-27 23:02:04,590 INFO [train.py:996] (3/4) Epoch 11, batch 14550, loss[loss=0.2361, simple_loss=0.3145, pruned_loss=0.07885, over 21418.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2883, pruned_loss=0.06704, over 4274651.81 frames. ], batch size: 211, lr: 2.66e-03, grad_scale: 16.0 2023-06-27 23:02:14,903 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.488e+02 6.849e+02 9.219e+02 1.443e+03 4.541e+03, threshold=1.844e+03, percent-clipped=15.0 2023-06-27 23:02:23,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1916976.0, ans=0.0 2023-06-27 23:02:58,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1917096.0, ans=0.0 2023-06-27 23:03:30,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1917216.0, ans=0.05 2023-06-27 23:03:35,784 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:03:41,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-27 23:03:48,448 INFO [train.py:996] (3/4) Epoch 11, batch 14600, loss[loss=0.1889, simple_loss=0.2473, pruned_loss=0.06524, over 20951.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2944, pruned_loss=0.06997, over 4273737.30 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:04:28,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1917336.0, ans=0.125 2023-06-27 23:04:48,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1917396.0, ans=0.125 2023-06-27 23:05:03,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1917456.0, ans=0.0 2023-06-27 23:05:20,701 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-27 23:05:25,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1917516.0, ans=0.025 2023-06-27 23:05:31,380 INFO [train.py:996] (3/4) Epoch 11, batch 14650, loss[loss=0.2619, simple_loss=0.3303, pruned_loss=0.09677, over 21367.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2953, pruned_loss=0.06885, over 4274985.93 frames. ], batch size: 549, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:05:45,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.373e+02 8.214e+02 1.374e+03 1.981e+03 3.761e+03, threshold=2.748e+03, percent-clipped=28.0 2023-06-27 23:05:50,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1917576.0, ans=0.125 2023-06-27 23:06:36,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1917696.0, ans=0.125 2023-06-27 23:06:46,839 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-27 23:06:56,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1917816.0, ans=0.125 2023-06-27 23:07:19,686 INFO [train.py:996] (3/4) Epoch 11, batch 14700, loss[loss=0.2361, simple_loss=0.3325, pruned_loss=0.06982, over 21714.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2898, pruned_loss=0.06396, over 4269419.71 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:07:48,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1917936.0, ans=0.0 2023-06-27 23:08:11,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1917996.0, ans=0.125 2023-06-27 23:09:04,341 INFO [train.py:996] (3/4) Epoch 11, batch 14750, loss[loss=0.2602, simple_loss=0.335, pruned_loss=0.09268, over 21624.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2937, pruned_loss=0.06638, over 4254338.53 frames. ], batch size: 389, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:09:14,865 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.274e+02 6.652e+02 9.504e+02 1.333e+03 3.432e+03, threshold=1.901e+03, percent-clipped=1.0 2023-06-27 23:09:44,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1918236.0, ans=0.07 2023-06-27 23:10:03,473 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-27 23:10:26,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1918356.0, ans=0.125 2023-06-27 23:10:27,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1918356.0, ans=0.2 2023-06-27 23:10:40,779 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1918416.0, ans=0.0 2023-06-27 23:10:48,844 INFO [train.py:996] (3/4) Epoch 11, batch 14800, loss[loss=0.2168, simple_loss=0.2981, pruned_loss=0.06778, over 21572.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3041, pruned_loss=0.07089, over 4254496.31 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:10:55,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1918476.0, ans=0.0 2023-06-27 23:10:58,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1918476.0, ans=0.125 2023-06-27 23:11:24,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1918536.0, ans=0.0 2023-06-27 23:11:36,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1918596.0, ans=0.2 2023-06-27 23:12:14,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1918716.0, ans=0.1 2023-06-27 23:12:43,483 INFO [train.py:996] (3/4) Epoch 11, batch 14850, loss[loss=0.1722, simple_loss=0.2444, pruned_loss=0.04996, over 21536.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2979, pruned_loss=0.07021, over 4257073.77 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:13:00,843 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.760e+02 8.785e+02 1.252e+03 1.775e+03 4.444e+03, threshold=2.503e+03, percent-clipped=22.0 2023-06-27 23:13:01,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1918776.0, ans=0.0 2023-06-27 23:13:24,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1918896.0, ans=0.1 2023-06-27 23:14:16,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1919016.0, ans=0.0 2023-06-27 23:14:26,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1919016.0, ans=0.125 2023-06-27 23:14:32,370 INFO [train.py:996] (3/4) Epoch 11, batch 14900, loss[loss=0.265, simple_loss=0.3377, pruned_loss=0.09611, over 21470.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3017, pruned_loss=0.07217, over 4264983.31 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:14:34,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1919076.0, ans=0.0 2023-06-27 23:14:42,555 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-27 23:15:21,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1919196.0, ans=0.125 2023-06-27 23:15:22,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1919196.0, ans=0.0 2023-06-27 23:15:45,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1919256.0, ans=0.125 2023-06-27 23:16:04,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1919316.0, ans=0.125 2023-06-27 23:16:16,122 INFO [train.py:996] (3/4) Epoch 11, batch 14950, loss[loss=0.1957, simple_loss=0.2891, pruned_loss=0.05113, over 21865.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3038, pruned_loss=0.07253, over 4261339.09 frames. ], batch size: 372, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:16:23,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1919376.0, ans=0.125 2023-06-27 23:16:26,866 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1919376.0, ans=0.0 2023-06-27 23:16:27,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.435e+02 7.906e+02 1.198e+03 1.645e+03 4.202e+03, threshold=2.397e+03, percent-clipped=8.0 2023-06-27 23:17:09,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1919496.0, ans=0.0 2023-06-27 23:17:11,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1919496.0, ans=0.125 2023-06-27 23:17:26,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1919556.0, ans=0.2 2023-06-27 23:17:40,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1919616.0, ans=0.0 2023-06-27 23:17:58,254 INFO [train.py:996] (3/4) Epoch 11, batch 15000, loss[loss=0.2083, simple_loss=0.2761, pruned_loss=0.07028, over 21804.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.305, pruned_loss=0.07391, over 4259077.50 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:17:58,254 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-27 23:18:18,454 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2534, simple_loss=0.3437, pruned_loss=0.08155, over 1796401.00 frames. 2023-06-27 23:18:18,455 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-27 23:18:25,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-27 23:18:27,098 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-27 23:19:42,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1919916.0, ans=0.125 2023-06-27 23:20:03,425 INFO [train.py:996] (3/4) Epoch 11, batch 15050, loss[loss=0.2531, simple_loss=0.3524, pruned_loss=0.07691, over 21261.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3071, pruned_loss=0.07505, over 4266504.27 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:20:17,264 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.821e+02 6.928e+02 9.435e+02 1.433e+03 3.639e+03, threshold=1.887e+03, percent-clipped=3.0 2023-06-27 23:21:49,508 INFO [train.py:996] (3/4) Epoch 11, batch 15100, loss[loss=0.2291, simple_loss=0.3123, pruned_loss=0.07297, over 21263.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3069, pruned_loss=0.07427, over 4259963.57 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:21:57,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1920276.0, ans=15.0 2023-06-27 23:22:48,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1920396.0, ans=0.125 2023-06-27 23:23:12,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1920516.0, ans=0.2 2023-06-27 23:23:18,426 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.15 vs. limit=12.0 2023-06-27 23:23:23,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-27 23:23:26,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1920516.0, ans=0.125 2023-06-27 23:23:30,439 INFO [train.py:996] (3/4) Epoch 11, batch 15150, loss[loss=0.1978, simple_loss=0.2654, pruned_loss=0.06504, over 21742.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3038, pruned_loss=0.07419, over 4252297.16 frames. ], batch size: 334, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:23:46,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.704e+02 7.406e+02 1.033e+03 1.604e+03 3.709e+03, threshold=2.066e+03, percent-clipped=14.0 2023-06-27 23:24:05,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1920636.0, ans=0.125 2023-06-27 23:24:13,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1920636.0, ans=0.0 2023-06-27 23:24:25,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-27 23:25:12,971 INFO [train.py:996] (3/4) Epoch 11, batch 15200, loss[loss=0.2366, simple_loss=0.3362, pruned_loss=0.06854, over 20786.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.295, pruned_loss=0.07017, over 4261856.60 frames. ], batch size: 607, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:25:25,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.32 vs. limit=22.5 2023-06-27 23:25:40,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1920936.0, ans=0.0 2023-06-27 23:25:42,448 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=12.0 2023-06-27 23:25:49,148 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=15.0 2023-06-27 23:26:43,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1921116.0, ans=0.0 2023-06-27 23:26:58,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1921116.0, ans=0.07 2023-06-27 23:27:01,184 INFO [train.py:996] (3/4) Epoch 11, batch 15250, loss[loss=0.2333, simple_loss=0.2867, pruned_loss=0.08997, over 21231.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2893, pruned_loss=0.06907, over 4252389.52 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:27:23,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.342e+02 7.886e+02 1.142e+03 1.659e+03 3.992e+03, threshold=2.285e+03, percent-clipped=18.0 2023-06-27 23:27:24,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1921176.0, ans=0.2 2023-06-27 23:28:13,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=12.0 2023-06-27 23:28:42,799 INFO [train.py:996] (3/4) Epoch 11, batch 15300, loss[loss=0.2952, simple_loss=0.3376, pruned_loss=0.1264, over 21301.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2904, pruned_loss=0.07102, over 4259136.75 frames. ], batch size: 507, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:28:52,076 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=13.84 vs. limit=15.0 2023-06-27 23:28:53,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-27 23:29:07,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1921536.0, ans=0.125 2023-06-27 23:30:08,304 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-27 23:30:29,665 INFO [train.py:996] (3/4) Epoch 11, batch 15350, loss[loss=0.2246, simple_loss=0.3274, pruned_loss=0.06092, over 21634.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2965, pruned_loss=0.07329, over 4264195.04 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:30:45,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-27 23:30:47,340 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.758e+02 7.708e+02 1.113e+03 1.589e+03 3.642e+03, threshold=2.225e+03, percent-clipped=6.0 2023-06-27 23:30:47,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1921776.0, ans=0.0 2023-06-27 23:30:53,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1921836.0, ans=0.0 2023-06-27 23:31:31,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1921956.0, ans=0.125 2023-06-27 23:32:05,687 INFO [train.py:996] (3/4) Epoch 11, batch 15400, loss[loss=0.2411, simple_loss=0.3124, pruned_loss=0.08485, over 21833.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2971, pruned_loss=0.07179, over 4269383.50 frames. ], batch size: 414, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:32:20,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1922076.0, ans=0.0 2023-06-27 23:32:33,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1922136.0, ans=15.0 2023-06-27 23:33:11,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1922256.0, ans=0.125 2023-06-27 23:33:47,726 INFO [train.py:996] (3/4) Epoch 11, batch 15450, loss[loss=0.207, simple_loss=0.3057, pruned_loss=0.05417, over 21860.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2945, pruned_loss=0.0708, over 4277602.02 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:34:10,727 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.367e+02 6.968e+02 9.606e+02 1.449e+03 2.613e+03, threshold=1.921e+03, percent-clipped=5.0 2023-06-27 23:34:23,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1922436.0, ans=0.125 2023-06-27 23:34:49,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1922496.0, ans=0.125 2023-06-27 23:34:52,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1922556.0, ans=0.125 2023-06-27 23:35:01,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=22.5 2023-06-27 23:35:03,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1922556.0, ans=0.125 2023-06-27 23:35:17,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1922616.0, ans=0.125 2023-06-27 23:35:34,400 INFO [train.py:996] (3/4) Epoch 11, batch 15500, loss[loss=0.1978, simple_loss=0.2747, pruned_loss=0.06044, over 21107.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.298, pruned_loss=0.07015, over 4269358.21 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:35:44,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1922676.0, ans=0.125 2023-06-27 23:36:41,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1922856.0, ans=0.125 2023-06-27 23:37:21,916 INFO [train.py:996] (3/4) Epoch 11, batch 15550, loss[loss=0.1976, simple_loss=0.2923, pruned_loss=0.05148, over 21789.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2957, pruned_loss=0.06766, over 4255327.93 frames. ], batch size: 371, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:37:30,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1922976.0, ans=0.2 2023-06-27 23:37:34,970 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.488e+02 6.660e+02 9.717e+02 1.306e+03 2.635e+03, threshold=1.943e+03, percent-clipped=6.0 2023-06-27 23:38:02,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1923096.0, ans=0.125 2023-06-27 23:38:42,736 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.84 vs. limit=15.0 2023-06-27 23:39:03,933 INFO [train.py:996] (3/4) Epoch 11, batch 15600, loss[loss=0.2096, simple_loss=0.2857, pruned_loss=0.06677, over 21612.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2897, pruned_loss=0.06638, over 4254642.00 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:39:04,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1923276.0, ans=0.125 2023-06-27 23:39:38,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1923336.0, ans=0.2 2023-06-27 23:39:58,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-27 23:40:00,454 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=22.5 2023-06-27 23:40:42,323 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:40:45,241 INFO [train.py:996] (3/4) Epoch 11, batch 15650, loss[loss=0.2444, simple_loss=0.2889, pruned_loss=0.0999, over 21345.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2894, pruned_loss=0.06645, over 4255448.80 frames. ], batch size: 508, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:41:03,338 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.415e+02 8.795e+02 1.290e+03 1.896e+03 3.786e+03, threshold=2.580e+03, percent-clipped=24.0 2023-06-27 23:41:07,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1923636.0, ans=0.125 2023-06-27 23:41:20,425 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1923636.0, ans=0.035 2023-06-27 23:41:20,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1923636.0, ans=0.125 2023-06-27 23:41:53,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1923756.0, ans=0.125 2023-06-27 23:42:27,304 INFO [train.py:996] (3/4) Epoch 11, batch 15700, loss[loss=0.2138, simple_loss=0.2822, pruned_loss=0.07273, over 21445.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2866, pruned_loss=0.066, over 4247175.80 frames. ], batch size: 441, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:42:45,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1923876.0, ans=0.125 2023-06-27 23:44:07,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1924176.0, ans=0.05 2023-06-27 23:44:08,200 INFO [train.py:996] (3/4) Epoch 11, batch 15750, loss[loss=0.1814, simple_loss=0.2539, pruned_loss=0.05445, over 21783.00 frames. ], tot_loss[loss=0.2066, simple_loss=0.2824, pruned_loss=0.06541, over 4249669.53 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:44:27,426 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.257e+02 5.974e+02 8.242e+02 1.132e+03 2.648e+03, threshold=1.648e+03, percent-clipped=1.0 2023-06-27 23:44:35,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1924236.0, ans=0.125 2023-06-27 23:45:49,092 INFO [train.py:996] (3/4) Epoch 11, batch 15800, loss[loss=0.1877, simple_loss=0.2563, pruned_loss=0.05957, over 21615.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2782, pruned_loss=0.06486, over 4260211.99 frames. ], batch size: 332, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:46:19,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1924536.0, ans=0.2 2023-06-27 23:46:30,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1924596.0, ans=10.0 2023-06-27 23:47:32,267 INFO [train.py:996] (3/4) Epoch 11, batch 15850, loss[loss=0.2424, simple_loss=0.3076, pruned_loss=0.08857, over 21703.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2805, pruned_loss=0.06707, over 4262923.94 frames. ], batch size: 441, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:47:49,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1924776.0, ans=0.2 2023-06-27 23:47:52,234 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.544e+02 6.551e+02 9.403e+02 1.336e+03 2.589e+03, threshold=1.881e+03, percent-clipped=10.0 2023-06-27 23:47:56,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-27 23:48:40,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1924956.0, ans=0.0 2023-06-27 23:48:41,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1924956.0, ans=0.0 2023-06-27 23:49:15,200 INFO [train.py:996] (3/4) Epoch 11, batch 15900, loss[loss=0.188, simple_loss=0.2556, pruned_loss=0.06022, over 21521.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2771, pruned_loss=0.06647, over 4262091.17 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:49:15,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1925076.0, ans=0.0 2023-06-27 23:49:47,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1925136.0, ans=0.125 2023-06-27 23:50:17,011 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1925256.0, ans=0.125 2023-06-27 23:50:20,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1925256.0, ans=0.0 2023-06-27 23:50:26,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1925256.0, ans=0.0 2023-06-27 23:50:35,293 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1925316.0, ans=0.125 2023-06-27 23:50:36,755 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:50:57,571 INFO [train.py:996] (3/4) Epoch 11, batch 15950, loss[loss=0.2016, simple_loss=0.2898, pruned_loss=0.05671, over 21376.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2775, pruned_loss=0.06537, over 4249485.66 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:51:17,432 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.351e+02 7.372e+02 1.063e+03 1.688e+03 3.100e+03, threshold=2.125e+03, percent-clipped=16.0 2023-06-27 23:51:38,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1925496.0, ans=0.125 2023-06-27 23:51:40,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1925496.0, ans=0.0 2023-06-27 23:52:13,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1925616.0, ans=10.0 2023-06-27 23:52:29,246 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1925616.0, ans=0.09899494936611666 2023-06-27 23:52:29,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1925616.0, ans=0.2 2023-06-27 23:52:38,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1925676.0, ans=0.2 2023-06-27 23:52:40,049 INFO [train.py:996] (3/4) Epoch 11, batch 16000, loss[loss=0.1989, simple_loss=0.2953, pruned_loss=0.05126, over 21757.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.279, pruned_loss=0.06293, over 4254812.18 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 32.0 2023-06-27 23:52:44,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1925676.0, ans=0.2 2023-06-27 23:52:50,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1925676.0, ans=0.2 2023-06-27 23:54:02,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.60 vs. limit=10.0 2023-06-27 23:54:17,636 INFO [train.py:996] (3/4) Epoch 11, batch 16050, loss[loss=0.2014, simple_loss=0.3001, pruned_loss=0.05131, over 21713.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2839, pruned_loss=0.06167, over 4264990.47 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:54:18,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1925976.0, ans=0.1 2023-06-27 23:54:35,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1925976.0, ans=0.125 2023-06-27 23:54:43,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.281e+02 6.057e+02 9.389e+02 1.429e+03 3.235e+03, threshold=1.878e+03, percent-clipped=6.0 2023-06-27 23:54:49,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1926036.0, ans=0.125 2023-06-27 23:54:59,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1926096.0, ans=0.0 2023-06-27 23:55:54,185 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1926216.0, ans=0.125 2023-06-27 23:55:56,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=22.5 2023-06-27 23:55:56,782 INFO [train.py:996] (3/4) Epoch 11, batch 16100, loss[loss=0.2477, simple_loss=0.3104, pruned_loss=0.09245, over 21802.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2877, pruned_loss=0.06261, over 4272750.41 frames. ], batch size: 508, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:56:13,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1926276.0, ans=0.1 2023-06-27 23:56:41,247 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-27 23:57:08,124 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-27 23:57:23,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1926516.0, ans=0.0 2023-06-27 23:57:25,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1926516.0, ans=0.1 2023-06-27 23:57:35,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1926516.0, ans=0.0 2023-06-27 23:57:37,660 INFO [train.py:996] (3/4) Epoch 11, batch 16150, loss[loss=0.1998, simple_loss=0.2705, pruned_loss=0.06453, over 20058.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.287, pruned_loss=0.0641, over 4281412.50 frames. ], batch size: 702, lr: 2.65e-03, grad_scale: 16.0 2023-06-27 23:58:02,681 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-27 23:58:03,757 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.374e+02 7.210e+02 1.100e+03 1.545e+03 2.941e+03, threshold=2.200e+03, percent-clipped=14.0 2023-06-27 23:58:08,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1926636.0, ans=0.1 2023-06-27 23:59:01,121 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.11 vs. limit=6.0 2023-06-27 23:59:19,445 INFO [train.py:996] (3/4) Epoch 11, batch 16200, loss[loss=0.2697, simple_loss=0.3434, pruned_loss=0.09805, over 21475.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2913, pruned_loss=0.06601, over 4283774.18 frames. ], batch size: 131, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:00:35,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1927056.0, ans=0.0 2023-06-28 00:01:06,273 INFO [train.py:996] (3/4) Epoch 11, batch 16250, loss[loss=0.1994, simple_loss=0.2749, pruned_loss=0.06197, over 21683.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2909, pruned_loss=0.06665, over 4279344.61 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:01:27,646 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.408e+02 8.189e+02 1.172e+03 1.830e+03 4.029e+03, threshold=2.343e+03, percent-clipped=14.0 2023-06-28 00:01:52,251 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-28 00:02:14,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1927356.0, ans=0.2 2023-06-28 00:02:42,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1927416.0, ans=0.125 2023-06-28 00:02:53,055 INFO [train.py:996] (3/4) Epoch 11, batch 16300, loss[loss=0.1832, simple_loss=0.2776, pruned_loss=0.04441, over 21208.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2857, pruned_loss=0.06381, over 4269266.12 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:03:40,200 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1927596.0, ans=0.125 2023-06-28 00:04:22,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1927716.0, ans=0.125 2023-06-28 00:04:32,775 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:04:37,032 INFO [train.py:996] (3/4) Epoch 11, batch 16350, loss[loss=0.2122, simple_loss=0.3046, pruned_loss=0.05991, over 20791.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.285, pruned_loss=0.06401, over 4269164.56 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:04:53,539 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.626e+02 5.951e+02 8.785e+02 1.347e+03 2.273e+03, threshold=1.757e+03, percent-clipped=0.0 2023-06-28 00:05:21,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1927896.0, ans=0.5 2023-06-28 00:05:28,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1927896.0, ans=0.2 2023-06-28 00:05:46,438 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=12.0 2023-06-28 00:06:15,110 INFO [train.py:996] (3/4) Epoch 11, batch 16400, loss[loss=0.2012, simple_loss=0.278, pruned_loss=0.06223, over 21883.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2901, pruned_loss=0.06592, over 4274896.65 frames. ], batch size: 107, lr: 2.65e-03, grad_scale: 32.0 2023-06-28 00:07:22,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1928256.0, ans=0.0 2023-06-28 00:07:56,945 INFO [train.py:996] (3/4) Epoch 11, batch 16450, loss[loss=0.2535, simple_loss=0.3162, pruned_loss=0.0954, over 21754.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2903, pruned_loss=0.06634, over 4274217.07 frames. ], batch size: 508, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:08:11,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1928376.0, ans=0.2 2023-06-28 00:08:16,237 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.401e+02 6.665e+02 9.796e+02 1.595e+03 2.942e+03, threshold=1.959e+03, percent-clipped=15.0 2023-06-28 00:08:19,186 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=12.0 2023-06-28 00:08:42,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1928496.0, ans=0.125 2023-06-28 00:09:10,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1928556.0, ans=0.125 2023-06-28 00:09:41,706 INFO [train.py:996] (3/4) Epoch 11, batch 16500, loss[loss=0.2715, simple_loss=0.3545, pruned_loss=0.09424, over 20007.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2897, pruned_loss=0.0668, over 4276420.12 frames. ], batch size: 702, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:09:44,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1928676.0, ans=0.0 2023-06-28 00:09:45,774 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:09:47,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1928676.0, ans=0.125 2023-06-28 00:11:09,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1928916.0, ans=0.0 2023-06-28 00:11:14,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1928916.0, ans=0.125 2023-06-28 00:11:23,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1928916.0, ans=0.125 2023-06-28 00:11:26,495 INFO [train.py:996] (3/4) Epoch 11, batch 16550, loss[loss=0.1827, simple_loss=0.2524, pruned_loss=0.05654, over 21834.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2897, pruned_loss=0.06554, over 4281275.87 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:11:27,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1928976.0, ans=0.2 2023-06-28 00:11:35,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1928976.0, ans=22.5 2023-06-28 00:11:50,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.361e+02 7.345e+02 1.277e+03 1.917e+03 4.181e+03, threshold=2.555e+03, percent-clipped=23.0 2023-06-28 00:13:15,385 INFO [train.py:996] (3/4) Epoch 11, batch 16600, loss[loss=0.2673, simple_loss=0.3854, pruned_loss=0.0746, over 19690.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2968, pruned_loss=0.06787, over 4277938.15 frames. ], batch size: 702, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:13:19,605 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1929276.0, ans=0.125 2023-06-28 00:13:37,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1929336.0, ans=0.125 2023-06-28 00:13:57,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1929396.0, ans=0.015 2023-06-28 00:14:19,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-28 00:14:55,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1929516.0, ans=0.125 2023-06-28 00:14:59,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1929576.0, ans=0.0 2023-06-28 00:15:00,060 INFO [train.py:996] (3/4) Epoch 11, batch 16650, loss[loss=0.2321, simple_loss=0.3112, pruned_loss=0.07649, over 21384.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3066, pruned_loss=0.07025, over 4275130.01 frames. ], batch size: 549, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:15:15,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1929576.0, ans=0.125 2023-06-28 00:15:28,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.761e+02 8.064e+02 1.116e+03 1.585e+03 3.216e+03, threshold=2.231e+03, percent-clipped=5.0 2023-06-28 00:16:22,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1929756.0, ans=0.125 2023-06-28 00:16:50,149 INFO [train.py:996] (3/4) Epoch 11, batch 16700, loss[loss=0.1664, simple_loss=0.2308, pruned_loss=0.05105, over 21812.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3076, pruned_loss=0.0711, over 4278510.61 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:16:58,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1929876.0, ans=0.125 2023-06-28 00:17:32,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1929936.0, ans=0.125 2023-06-28 00:18:29,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1930116.0, ans=0.0 2023-06-28 00:18:47,085 INFO [train.py:996] (3/4) Epoch 11, batch 16750, loss[loss=0.2292, simple_loss=0.3399, pruned_loss=0.05926, over 20806.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3107, pruned_loss=0.07394, over 4276497.69 frames. ], batch size: 607, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:18:49,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1930176.0, ans=0.07 2023-06-28 00:19:05,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1930176.0, ans=0.1 2023-06-28 00:19:12,035 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.860e+02 6.994e+02 8.979e+02 1.342e+03 3.526e+03, threshold=1.796e+03, percent-clipped=9.0 2023-06-28 00:20:16,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1930416.0, ans=0.125 2023-06-28 00:20:37,333 INFO [train.py:996] (3/4) Epoch 11, batch 16800, loss[loss=0.2347, simple_loss=0.3617, pruned_loss=0.05385, over 20738.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3152, pruned_loss=0.07423, over 4269337.05 frames. ], batch size: 607, lr: 2.65e-03, grad_scale: 32.0 2023-06-28 00:21:17,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1930596.0, ans=0.125 2023-06-28 00:22:01,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1930716.0, ans=0.125 2023-06-28 00:22:18,739 INFO [train.py:996] (3/4) Epoch 11, batch 16850, loss[loss=0.2189, simple_loss=0.3311, pruned_loss=0.05334, over 20876.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3125, pruned_loss=0.07335, over 4273207.67 frames. ], batch size: 607, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:22:28,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-28 00:22:36,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.15 vs. limit=6.0 2023-06-28 00:22:38,611 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.042e+02 8.354e+02 1.397e+03 2.191e+03 5.653e+03, threshold=2.793e+03, percent-clipped=35.0 2023-06-28 00:22:39,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1930836.0, ans=0.2 2023-06-28 00:24:00,836 INFO [train.py:996] (3/4) Epoch 11, batch 16900, loss[loss=0.1752, simple_loss=0.2495, pruned_loss=0.05045, over 21533.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3068, pruned_loss=0.07245, over 4277489.96 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:24:07,003 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-28 00:24:17,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1931136.0, ans=0.0 2023-06-28 00:25:41,091 INFO [train.py:996] (3/4) Epoch 11, batch 16950, loss[loss=0.1931, simple_loss=0.2649, pruned_loss=0.06071, over 21427.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2991, pruned_loss=0.07052, over 4274480.56 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-28 00:26:00,776 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.390e+02 6.361e+02 9.262e+02 1.143e+03 1.974e+03, threshold=1.852e+03, percent-clipped=0.0 2023-06-28 00:26:01,570 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:26:26,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1931496.0, ans=0.125 2023-06-28 00:26:46,138 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2023-06-28 00:26:55,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1931556.0, ans=0.125 2023-06-28 00:26:55,813 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-28 00:27:20,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1931616.0, ans=0.0 2023-06-28 00:27:22,694 INFO [train.py:996] (3/4) Epoch 11, batch 17000, loss[loss=0.2252, simple_loss=0.3, pruned_loss=0.0752, over 21849.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2953, pruned_loss=0.07028, over 4285564.73 frames. ], batch size: 124, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:27:33,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1931676.0, ans=0.125 2023-06-28 00:28:21,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1931796.0, ans=0.2 2023-06-28 00:28:35,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=12.0 2023-06-28 00:28:55,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1931916.0, ans=0.2 2023-06-28 00:28:57,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1931916.0, ans=0.1 2023-06-28 00:28:58,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1931916.0, ans=0.0 2023-06-28 00:29:01,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1931916.0, ans=0.2 2023-06-28 00:29:06,153 INFO [train.py:996] (3/4) Epoch 11, batch 17050, loss[loss=0.2235, simple_loss=0.3139, pruned_loss=0.06659, over 21446.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.302, pruned_loss=0.07251, over 4284123.74 frames. ], batch size: 211, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:29:13,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1931976.0, ans=0.125 2023-06-28 00:29:26,235 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.012e+02 8.433e+02 1.501e+03 2.176e+03 5.028e+03, threshold=3.003e+03, percent-clipped=35.0 2023-06-28 00:29:32,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1932036.0, ans=0.0 2023-06-28 00:30:13,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932156.0, ans=0.1 2023-06-28 00:30:39,009 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1932216.0, ans=0.2 2023-06-28 00:30:46,880 INFO [train.py:996] (3/4) Epoch 11, batch 17100, loss[loss=0.233, simple_loss=0.3079, pruned_loss=0.079, over 21804.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2998, pruned_loss=0.07273, over 4289215.13 frames. ], batch size: 112, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:31:20,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1932336.0, ans=0.0 2023-06-28 00:31:37,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1932396.0, ans=0.125 2023-06-28 00:31:59,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1932456.0, ans=0.0 2023-06-28 00:32:06,122 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-28 00:32:29,946 INFO [train.py:996] (3/4) Epoch 11, batch 17150, loss[loss=0.1871, simple_loss=0.2677, pruned_loss=0.05323, over 21736.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2968, pruned_loss=0.07215, over 4290789.38 frames. ], batch size: 389, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:32:54,500 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.748e+02 5.716e+02 7.652e+02 9.791e+02 2.028e+03, threshold=1.530e+03, percent-clipped=0.0 2023-06-28 00:33:13,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1932696.0, ans=0.1 2023-06-28 00:33:14,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1932696.0, ans=0.1 2023-06-28 00:33:19,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1932696.0, ans=0.0 2023-06-28 00:33:51,164 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1932756.0, ans=0.0 2023-06-28 00:34:07,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932816.0, ans=0.1 2023-06-28 00:34:09,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1932816.0, ans=0.0 2023-06-28 00:34:16,988 INFO [train.py:996] (3/4) Epoch 11, batch 17200, loss[loss=0.2423, simple_loss=0.3184, pruned_loss=0.0831, over 21273.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2959, pruned_loss=0.07211, over 4293696.73 frames. ], batch size: 143, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 00:35:01,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1932996.0, ans=0.0 2023-06-28 00:35:25,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1933056.0, ans=0.125 2023-06-28 00:35:39,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1933116.0, ans=0.0 2023-06-28 00:35:40,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.86 vs. limit=5.0 2023-06-28 00:35:59,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1933176.0, ans=10.0 2023-06-28 00:36:00,793 INFO [train.py:996] (3/4) Epoch 11, batch 17250, loss[loss=0.2229, simple_loss=0.3036, pruned_loss=0.07111, over 21895.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2987, pruned_loss=0.0732, over 4292212.21 frames. ], batch size: 371, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:36:20,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1933176.0, ans=0.2 2023-06-28 00:36:32,860 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.884e+02 8.366e+02 1.182e+03 1.787e+03 4.360e+03, threshold=2.365e+03, percent-clipped=31.0 2023-06-28 00:36:57,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-28 00:36:58,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1933296.0, ans=0.1 2023-06-28 00:37:00,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.87 vs. limit=22.5 2023-06-28 00:37:49,418 INFO [train.py:996] (3/4) Epoch 11, batch 17300, loss[loss=0.2553, simple_loss=0.3316, pruned_loss=0.08952, over 21301.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3071, pruned_loss=0.07641, over 4288721.63 frames. ], batch size: 143, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:37:53,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1933476.0, ans=0.125 2023-06-28 00:38:09,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1933476.0, ans=0.035 2023-06-28 00:38:14,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1933536.0, ans=0.0 2023-06-28 00:38:44,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1933596.0, ans=0.1 2023-06-28 00:38:58,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1933656.0, ans=0.125 2023-06-28 00:39:02,276 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-28 00:39:02,477 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.97 vs. limit=5.0 2023-06-28 00:39:40,381 INFO [train.py:996] (3/4) Epoch 11, batch 17350, loss[loss=0.1852, simple_loss=0.2693, pruned_loss=0.05051, over 21400.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3085, pruned_loss=0.07636, over 4282327.88 frames. ], batch size: 211, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:40:07,461 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.114e+02 8.299e+02 1.147e+03 1.835e+03 3.555e+03, threshold=2.294e+03, percent-clipped=8.0 2023-06-28 00:40:30,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1933896.0, ans=0.1 2023-06-28 00:41:25,715 INFO [train.py:996] (3/4) Epoch 11, batch 17400, loss[loss=0.213, simple_loss=0.2713, pruned_loss=0.07734, over 20145.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3049, pruned_loss=0.07316, over 4282814.00 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:41:39,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-28 00:43:13,926 INFO [train.py:996] (3/4) Epoch 11, batch 17450, loss[loss=0.1771, simple_loss=0.2746, pruned_loss=0.03974, over 21701.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3009, pruned_loss=0.07088, over 4268051.23 frames. ], batch size: 298, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:43:41,606 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.482e+02 8.576e+02 1.354e+03 2.024e+03 4.305e+03, threshold=2.708e+03, percent-clipped=16.0 2023-06-28 00:44:05,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1934496.0, ans=0.125 2023-06-28 00:44:55,320 INFO [train.py:996] (3/4) Epoch 11, batch 17500, loss[loss=0.2162, simple_loss=0.2953, pruned_loss=0.0686, over 21393.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.297, pruned_loss=0.06816, over 4273608.06 frames. ], batch size: 131, lr: 2.64e-03, grad_scale: 8.0 2023-06-28 00:44:59,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1934676.0, ans=0.0 2023-06-28 00:45:00,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1934676.0, ans=0.125 2023-06-28 00:45:04,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1934676.0, ans=0.2 2023-06-28 00:45:19,960 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1934736.0, ans=0.125 2023-06-28 00:45:20,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1934736.0, ans=0.125 2023-06-28 00:45:33,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1934796.0, ans=0.0 2023-06-28 00:46:35,448 INFO [train.py:996] (3/4) Epoch 11, batch 17550, loss[loss=0.215, simple_loss=0.31, pruned_loss=0.06001, over 21462.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.296, pruned_loss=0.0667, over 4267484.70 frames. ], batch size: 194, lr: 2.64e-03, grad_scale: 8.0 2023-06-28 00:47:00,422 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1935036.0, ans=0.125 2023-06-28 00:47:02,918 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.317e+02 6.340e+02 7.775e+02 1.102e+03 1.869e+03, threshold=1.555e+03, percent-clipped=0.0 2023-06-28 00:47:16,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1935096.0, ans=0.2 2023-06-28 00:48:16,456 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-28 00:48:16,978 INFO [train.py:996] (3/4) Epoch 11, batch 17600, loss[loss=0.2036, simple_loss=0.2838, pruned_loss=0.0617, over 20622.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2989, pruned_loss=0.06719, over 4259815.22 frames. ], batch size: 607, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:48:39,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1935336.0, ans=0.2 2023-06-28 00:49:07,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1935396.0, ans=0.125 2023-06-28 00:49:12,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1935396.0, ans=0.125 2023-06-28 00:49:24,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1935456.0, ans=0.2 2023-06-28 00:49:37,997 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 00:49:54,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1935516.0, ans=0.125 2023-06-28 00:50:01,081 INFO [train.py:996] (3/4) Epoch 11, batch 17650, loss[loss=0.1698, simple_loss=0.2384, pruned_loss=0.05064, over 21645.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2973, pruned_loss=0.06754, over 4269750.94 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:50:01,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1935576.0, ans=0.1 2023-06-28 00:50:08,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1935576.0, ans=0.04949747468305833 2023-06-28 00:50:29,621 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.907e+02 7.332e+02 1.084e+03 1.896e+03 3.594e+03, threshold=2.168e+03, percent-clipped=34.0 2023-06-28 00:50:41,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-28 00:50:42,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1935636.0, ans=0.125 2023-06-28 00:50:56,791 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.15 vs. limit=15.0 2023-06-28 00:51:32,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1935816.0, ans=0.125 2023-06-28 00:51:49,572 INFO [train.py:996] (3/4) Epoch 11, batch 17700, loss[loss=0.2059, simple_loss=0.299, pruned_loss=0.0564, over 21565.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2901, pruned_loss=0.06493, over 4254280.74 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:52:47,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1935996.0, ans=0.125 2023-06-28 00:53:33,360 INFO [train.py:996] (3/4) Epoch 11, batch 17750, loss[loss=0.2258, simple_loss=0.309, pruned_loss=0.07131, over 21624.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2984, pruned_loss=0.06817, over 4262621.77 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:54:01,414 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.937e+02 7.178e+02 1.077e+03 1.520e+03 3.336e+03, threshold=2.154e+03, percent-clipped=9.0 2023-06-28 00:54:47,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1936356.0, ans=0.2 2023-06-28 00:55:22,098 INFO [train.py:996] (3/4) Epoch 11, batch 17800, loss[loss=0.2023, simple_loss=0.2838, pruned_loss=0.06041, over 19942.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2974, pruned_loss=0.06737, over 4264177.79 frames. ], batch size: 703, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:55:22,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1936476.0, ans=0.125 2023-06-28 00:56:00,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1936596.0, ans=0.125 2023-06-28 00:56:34,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1936656.0, ans=0.125 2023-06-28 00:56:49,704 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=1936716.0, ans=0.1 2023-06-28 00:56:54,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1936716.0, ans=0.125 2023-06-28 00:57:05,740 INFO [train.py:996] (3/4) Epoch 11, batch 17850, loss[loss=0.242, simple_loss=0.3236, pruned_loss=0.08023, over 21363.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2982, pruned_loss=0.06732, over 4266317.99 frames. ], batch size: 549, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:57:14,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1936776.0, ans=0.125 2023-06-28 00:57:34,255 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.504e+02 7.242e+02 1.057e+03 1.582e+03 3.438e+03, threshold=2.115e+03, percent-clipped=9.0 2023-06-28 00:57:42,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1936836.0, ans=0.0 2023-06-28 00:57:50,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-28 00:57:57,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=22.5 2023-06-28 00:57:58,745 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-28 00:58:40,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1937016.0, ans=0.1 2023-06-28 00:58:47,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1937076.0, ans=0.0 2023-06-28 00:58:48,598 INFO [train.py:996] (3/4) Epoch 11, batch 17900, loss[loss=0.2179, simple_loss=0.3011, pruned_loss=0.06732, over 21102.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.302, pruned_loss=0.0687, over 4263012.99 frames. ], batch size: 143, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 00:59:19,505 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-28 00:59:54,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1937196.0, ans=0.1 2023-06-28 01:00:06,638 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:00:06,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1937256.0, ans=0.1 2023-06-28 01:00:37,333 INFO [train.py:996] (3/4) Epoch 11, batch 17950, loss[loss=0.1668, simple_loss=0.2643, pruned_loss=0.0347, over 21758.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.3009, pruned_loss=0.06576, over 4256174.12 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:00:42,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1937376.0, ans=0.125 2023-06-28 01:01:09,590 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.263e+02 6.938e+02 9.459e+02 1.364e+03 3.127e+03, threshold=1.892e+03, percent-clipped=7.0 2023-06-28 01:01:44,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=15.0 2023-06-28 01:02:22,722 INFO [train.py:996] (3/4) Epoch 11, batch 18000, loss[loss=0.173, simple_loss=0.2268, pruned_loss=0.05967, over 20670.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2937, pruned_loss=0.06402, over 4261635.41 frames. ], batch size: 607, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 01:02:22,722 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 01:02:39,148 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2572, simple_loss=0.3509, pruned_loss=0.08176, over 1796401.00 frames. 2023-06-28 01:02:39,149 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 01:03:34,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1937796.0, ans=0.125 2023-06-28 01:04:22,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-28 01:04:22,706 INFO [train.py:996] (3/4) Epoch 11, batch 18050, loss[loss=0.1875, simple_loss=0.2561, pruned_loss=0.05948, over 20727.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2877, pruned_loss=0.06314, over 4252266.89 frames. ], batch size: 607, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:04:51,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1938036.0, ans=0.0 2023-06-28 01:04:58,009 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.600e+02 6.639e+02 9.648e+02 1.453e+03 3.276e+03, threshold=1.930e+03, percent-clipped=8.0 2023-06-28 01:05:06,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1938096.0, ans=0.07 2023-06-28 01:05:51,663 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1938216.0, ans=0.125 2023-06-28 01:06:00,555 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-28 01:06:10,712 INFO [train.py:996] (3/4) Epoch 11, batch 18100, loss[loss=0.2318, simple_loss=0.3259, pruned_loss=0.06882, over 21692.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2929, pruned_loss=0.06582, over 4257945.57 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:06:38,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1938336.0, ans=0.2 2023-06-28 01:07:05,108 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:07:05,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1938456.0, ans=0.125 2023-06-28 01:07:48,924 INFO [train.py:996] (3/4) Epoch 11, batch 18150, loss[loss=0.2111, simple_loss=0.2843, pruned_loss=0.06898, over 21763.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2945, pruned_loss=0.06551, over 4267520.23 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:08:18,396 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.434e+02 6.385e+02 9.174e+02 1.252e+03 3.670e+03, threshold=1.835e+03, percent-clipped=3.0 2023-06-28 01:08:37,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-28 01:08:50,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=22.5 2023-06-28 01:08:51,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1938756.0, ans=0.125 2023-06-28 01:09:24,173 INFO [train.py:996] (3/4) Epoch 11, batch 18200, loss[loss=0.1917, simple_loss=0.2615, pruned_loss=0.06096, over 21819.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2888, pruned_loss=0.06545, over 4273744.31 frames. ], batch size: 98, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:09:55,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.43 vs. limit=12.0 2023-06-28 01:10:00,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1938936.0, ans=0.0 2023-06-28 01:11:00,648 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.25 vs. limit=6.0 2023-06-28 01:11:04,698 INFO [train.py:996] (3/4) Epoch 11, batch 18250, loss[loss=0.1969, simple_loss=0.2688, pruned_loss=0.06247, over 21796.00 frames. ], tot_loss[loss=0.2038, simple_loss=0.2816, pruned_loss=0.063, over 4275572.29 frames. ], batch size: 298, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:11:08,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1939176.0, ans=0.1 2023-06-28 01:11:26,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1939236.0, ans=0.1 2023-06-28 01:11:29,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1939236.0, ans=0.0 2023-06-28 01:11:31,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2023-06-28 01:11:37,983 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.242e+02 6.955e+02 1.102e+03 1.552e+03 2.927e+03, threshold=2.205e+03, percent-clipped=10.0 2023-06-28 01:11:52,389 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-28 01:12:13,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1939356.0, ans=0.1 2023-06-28 01:12:13,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1939356.0, ans=0.125 2023-06-28 01:12:33,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1939416.0, ans=0.125 2023-06-28 01:12:40,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1939416.0, ans=0.1 2023-06-28 01:12:46,286 INFO [train.py:996] (3/4) Epoch 11, batch 18300, loss[loss=0.2497, simple_loss=0.3347, pruned_loss=0.08237, over 21712.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.2799, pruned_loss=0.0632, over 4268491.71 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:12:57,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1939476.0, ans=0.02 2023-06-28 01:12:58,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1939476.0, ans=0.2 2023-06-28 01:13:15,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1939536.0, ans=0.2 2023-06-28 01:13:17,576 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-28 01:13:28,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1939596.0, ans=0.125 2023-06-28 01:13:40,126 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1939656.0, ans=0.0 2023-06-28 01:14:22,405 INFO [train.py:996] (3/4) Epoch 11, batch 18350, loss[loss=0.2032, simple_loss=0.2695, pruned_loss=0.06845, over 21118.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2874, pruned_loss=0.06399, over 4267371.44 frames. ], batch size: 159, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:14:56,373 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.730e+02 6.827e+02 1.100e+03 1.659e+03 4.791e+03, threshold=2.200e+03, percent-clipped=14.0 2023-06-28 01:16:00,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1940016.0, ans=0.125 2023-06-28 01:16:05,017 INFO [train.py:996] (3/4) Epoch 11, batch 18400, loss[loss=0.1583, simple_loss=0.2393, pruned_loss=0.03864, over 21532.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.283, pruned_loss=0.06295, over 4252584.08 frames. ], batch size: 195, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 01:16:06,160 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-28 01:16:28,781 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1940136.0, ans=0.125 2023-06-28 01:16:43,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1940196.0, ans=0.125 2023-06-28 01:17:37,794 INFO [train.py:996] (3/4) Epoch 11, batch 18450, loss[loss=0.2098, simple_loss=0.2934, pruned_loss=0.06305, over 21498.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.28, pruned_loss=0.05963, over 4251836.35 frames. ], batch size: 473, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:17:51,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1940376.0, ans=0.1 2023-06-28 01:18:05,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.34 vs. limit=15.0 2023-06-28 01:18:14,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.159e+02 6.017e+02 7.931e+02 1.267e+03 3.301e+03, threshold=1.586e+03, percent-clipped=3.0 2023-06-28 01:18:32,060 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-28 01:19:07,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1940616.0, ans=0.05 2023-06-28 01:19:14,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1940676.0, ans=0.125 2023-06-28 01:19:15,635 INFO [train.py:996] (3/4) Epoch 11, batch 18500, loss[loss=0.1803, simple_loss=0.2668, pruned_loss=0.04693, over 21241.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2769, pruned_loss=0.05909, over 4242661.03 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:20:06,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1940796.0, ans=0.125 2023-06-28 01:20:31,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-06-28 01:20:57,748 INFO [train.py:996] (3/4) Epoch 11, batch 18550, loss[loss=0.21, simple_loss=0.2809, pruned_loss=0.06951, over 20080.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2763, pruned_loss=0.05865, over 4241352.10 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:21:00,482 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-06-28 01:21:22,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1941036.0, ans=0.125 2023-06-28 01:21:34,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.234e+02 6.100e+02 9.556e+02 1.452e+03 3.261e+03, threshold=1.911e+03, percent-clipped=19.0 2023-06-28 01:21:40,607 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-28 01:22:07,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1941156.0, ans=0.1 2023-06-28 01:22:19,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1941216.0, ans=0.125 2023-06-28 01:22:26,589 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1941216.0, ans=0.09899494936611666 2023-06-28 01:22:45,325 INFO [train.py:996] (3/4) Epoch 11, batch 18600, loss[loss=0.1726, simple_loss=0.2479, pruned_loss=0.04869, over 21351.00 frames. ], tot_loss[loss=0.1959, simple_loss=0.2742, pruned_loss=0.05878, over 4242174.82 frames. ], batch size: 159, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:24:12,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1941516.0, ans=0.125 2023-06-28 01:24:14,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1941516.0, ans=0.125 2023-06-28 01:24:21,653 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.54 vs. limit=22.5 2023-06-28 01:24:26,399 INFO [train.py:996] (3/4) Epoch 11, batch 18650, loss[loss=0.1841, simple_loss=0.2529, pruned_loss=0.05767, over 20011.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.2731, pruned_loss=0.05866, over 4229431.84 frames. ], batch size: 703, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:24:28,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1941576.0, ans=0.125 2023-06-28 01:24:38,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1941576.0, ans=0.2 2023-06-28 01:24:52,413 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.322e+02 7.479e+02 1.141e+03 1.737e+03 3.586e+03, threshold=2.283e+03, percent-clipped=19.0 2023-06-28 01:25:30,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1941756.0, ans=0.125 2023-06-28 01:25:35,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1941756.0, ans=0.125 2023-06-28 01:25:50,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1941816.0, ans=0.0 2023-06-28 01:25:56,792 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1941876.0, ans=0.125 2023-06-28 01:25:57,768 INFO [train.py:996] (3/4) Epoch 11, batch 18700, loss[loss=0.219, simple_loss=0.2871, pruned_loss=0.07547, over 15503.00 frames. ], tot_loss[loss=0.1965, simple_loss=0.272, pruned_loss=0.06053, over 4228826.26 frames. ], batch size: 60, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:26:26,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1941936.0, ans=0.0 2023-06-28 01:26:27,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1941936.0, ans=0.1 2023-06-28 01:26:31,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1941936.0, ans=0.0 2023-06-28 01:26:45,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=1941996.0, ans=12.0 2023-06-28 01:27:20,835 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-28 01:27:40,758 INFO [train.py:996] (3/4) Epoch 11, batch 18750, loss[loss=0.2012, simple_loss=0.2656, pruned_loss=0.06842, over 21779.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2749, pruned_loss=0.06277, over 4238030.01 frames. ], batch size: 247, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:27:41,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1942176.0, ans=0.0 2023-06-28 01:28:11,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1942236.0, ans=0.125 2023-06-28 01:28:17,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.499e+02 6.198e+02 1.010e+03 1.418e+03 2.835e+03, threshold=2.020e+03, percent-clipped=5.0 2023-06-28 01:28:30,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1942296.0, ans=0.0 2023-06-28 01:28:32,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1942296.0, ans=0.0 2023-06-28 01:28:51,123 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2023-06-28 01:29:00,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1942356.0, ans=0.2 2023-06-28 01:29:10,781 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-28 01:29:23,218 INFO [train.py:996] (3/4) Epoch 11, batch 18800, loss[loss=0.2059, simple_loss=0.2991, pruned_loss=0.05641, over 21842.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2809, pruned_loss=0.06359, over 4249019.21 frames. ], batch size: 316, lr: 2.64e-03, grad_scale: 32.0 2023-06-28 01:30:15,992 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:30:53,274 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1942716.0, ans=0.125 2023-06-28 01:30:59,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1942716.0, ans=0.125 2023-06-28 01:31:04,465 INFO [train.py:996] (3/4) Epoch 11, batch 18850, loss[loss=0.1829, simple_loss=0.2473, pruned_loss=0.0593, over 21235.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2776, pruned_loss=0.06041, over 4238589.23 frames. ], batch size: 159, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:31:24,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.21 vs. limit=22.5 2023-06-28 01:31:34,720 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-28 01:31:41,970 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.145e+02 6.934e+02 1.004e+03 1.636e+03 4.618e+03, threshold=2.007e+03, percent-clipped=13.0 2023-06-28 01:31:50,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1942896.0, ans=0.025 2023-06-28 01:32:46,418 INFO [train.py:996] (3/4) Epoch 11, batch 18900, loss[loss=0.1847, simple_loss=0.2386, pruned_loss=0.06545, over 20979.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2732, pruned_loss=0.06001, over 4245442.42 frames. ], batch size: 608, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:33:09,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1943076.0, ans=0.125 2023-06-28 01:33:19,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1943136.0, ans=0.2 2023-06-28 01:33:25,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1943136.0, ans=0.2 2023-06-28 01:33:26,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1943136.0, ans=0.125 2023-06-28 01:34:28,564 INFO [train.py:996] (3/4) Epoch 11, batch 18950, loss[loss=0.1989, simple_loss=0.2688, pruned_loss=0.0645, over 21641.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2745, pruned_loss=0.06134, over 4262644.15 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:35:07,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.266e+02 7.363e+02 1.116e+03 1.715e+03 3.795e+03, threshold=2.232e+03, percent-clipped=17.0 2023-06-28 01:35:49,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1943556.0, ans=0.125 2023-06-28 01:35:59,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1943616.0, ans=0.1 2023-06-28 01:36:09,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-28 01:36:16,456 INFO [train.py:996] (3/4) Epoch 11, batch 19000, loss[loss=0.2222, simple_loss=0.2995, pruned_loss=0.07243, over 21598.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2848, pruned_loss=0.06386, over 4270804.61 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:36:34,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1943676.0, ans=0.125 2023-06-28 01:37:10,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1943796.0, ans=0.125 2023-06-28 01:37:59,389 INFO [train.py:996] (3/4) Epoch 11, batch 19050, loss[loss=0.212, simple_loss=0.2838, pruned_loss=0.07008, over 21654.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2888, pruned_loss=0.06705, over 4281236.47 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:38:16,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1943976.0, ans=0.1 2023-06-28 01:38:34,342 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.763e+02 7.359e+02 1.013e+03 1.496e+03 3.084e+03, threshold=2.026e+03, percent-clipped=8.0 2023-06-28 01:38:48,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1944096.0, ans=0.125 2023-06-28 01:39:43,726 INFO [train.py:996] (3/4) Epoch 11, batch 19100, loss[loss=0.236, simple_loss=0.2821, pruned_loss=0.09495, over 21407.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2877, pruned_loss=0.06856, over 4286076.50 frames. ], batch size: 509, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:41:33,374 INFO [train.py:996] (3/4) Epoch 11, batch 19150, loss[loss=0.2446, simple_loss=0.3427, pruned_loss=0.07324, over 21155.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.289, pruned_loss=0.06892, over 4288162.54 frames. ], batch size: 548, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:41:55,591 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=12.0 2023-06-28 01:42:03,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1944636.0, ans=0.125 2023-06-28 01:42:07,458 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.03 vs. limit=10.0 2023-06-28 01:42:09,609 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 7.660e+02 1.202e+03 2.015e+03 4.043e+03, threshold=2.404e+03, percent-clipped=23.0 2023-06-28 01:42:14,312 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1944696.0, ans=0.0 2023-06-28 01:42:42,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1944756.0, ans=0.0 2023-06-28 01:43:19,387 INFO [train.py:996] (3/4) Epoch 11, batch 19200, loss[loss=0.1529, simple_loss=0.2327, pruned_loss=0.03649, over 16327.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2971, pruned_loss=0.06924, over 4276883.22 frames. ], batch size: 61, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:43:55,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1944936.0, ans=0.125 2023-06-28 01:44:15,757 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-28 01:44:31,257 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1945056.0, ans=0.125 2023-06-28 01:44:31,303 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1945056.0, ans=0.0 2023-06-28 01:44:35,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1945056.0, ans=0.1 2023-06-28 01:45:01,779 INFO [train.py:996] (3/4) Epoch 11, batch 19250, loss[loss=0.1607, simple_loss=0.2601, pruned_loss=0.03059, over 21647.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2991, pruned_loss=0.06513, over 4269249.68 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:45:30,959 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-28 01:45:36,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.112e+02 6.434e+02 9.084e+02 1.292e+03 2.942e+03, threshold=1.817e+03, percent-clipped=2.0 2023-06-28 01:46:09,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1945356.0, ans=0.125 2023-06-28 01:46:27,185 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:46:28,765 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:46:43,104 INFO [train.py:996] (3/4) Epoch 11, batch 19300, loss[loss=0.2037, simple_loss=0.2836, pruned_loss=0.06188, over 21715.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2956, pruned_loss=0.06438, over 4281450.75 frames. ], batch size: 389, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:47:25,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1945596.0, ans=0.0 2023-06-28 01:47:41,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1945596.0, ans=0.125 2023-06-28 01:47:48,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1945656.0, ans=0.1 2023-06-28 01:48:20,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1945716.0, ans=0.0 2023-06-28 01:48:21,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1945716.0, ans=0.125 2023-06-28 01:48:25,857 INFO [train.py:996] (3/4) Epoch 11, batch 19350, loss[loss=0.2234, simple_loss=0.2926, pruned_loss=0.07704, over 21345.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2899, pruned_loss=0.06083, over 4284224.94 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 16.0 2023-06-28 01:48:58,173 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-28 01:49:06,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.646e+02 6.544e+02 1.045e+03 1.616e+03 2.621e+03, threshold=2.089e+03, percent-clipped=15.0 2023-06-28 01:49:12,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1945896.0, ans=0.125 2023-06-28 01:49:47,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1945956.0, ans=0.1 2023-06-28 01:50:01,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1946016.0, ans=0.125 2023-06-28 01:50:06,769 INFO [train.py:996] (3/4) Epoch 11, batch 19400, loss[loss=0.2595, simple_loss=0.3651, pruned_loss=0.07692, over 19757.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2881, pruned_loss=0.06039, over 4283038.34 frames. ], batch size: 703, lr: 2.64e-03, grad_scale: 8.0 2023-06-28 01:50:14,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1946076.0, ans=0.125 2023-06-28 01:50:25,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1946136.0, ans=0.0 2023-06-28 01:50:46,641 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 01:51:48,630 INFO [train.py:996] (3/4) Epoch 11, batch 19450, loss[loss=0.1909, simple_loss=0.2545, pruned_loss=0.06366, over 21595.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2854, pruned_loss=0.06157, over 4286624.43 frames. ], batch size: 414, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 01:52:19,693 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1946436.0, ans=0.0 2023-06-28 01:52:29,855 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-28 01:52:30,296 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.692e+02 7.227e+02 1.148e+03 1.482e+03 2.916e+03, threshold=2.296e+03, percent-clipped=8.0 2023-06-28 01:52:44,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1946496.0, ans=0.2 2023-06-28 01:53:32,668 INFO [train.py:996] (3/4) Epoch 11, batch 19500, loss[loss=0.1957, simple_loss=0.277, pruned_loss=0.0572, over 21801.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2809, pruned_loss=0.06259, over 4277885.72 frames. ], batch size: 372, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 01:53:36,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1946676.0, ans=0.125 2023-06-28 01:53:48,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1946676.0, ans=0.125 2023-06-28 01:54:47,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1946856.0, ans=0.125 2023-06-28 01:55:12,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1946916.0, ans=0.125 2023-06-28 01:55:16,429 INFO [train.py:996] (3/4) Epoch 11, batch 19550, loss[loss=0.1413, simple_loss=0.2081, pruned_loss=0.03728, over 21223.00 frames. ], tot_loss[loss=0.2, simple_loss=0.2764, pruned_loss=0.06182, over 4272312.22 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 01:55:57,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.189e+02 6.305e+02 9.070e+02 1.284e+03 2.823e+03, threshold=1.814e+03, percent-clipped=4.0 2023-06-28 01:56:05,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1947096.0, ans=0.0 2023-06-28 01:56:33,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1947156.0, ans=0.125 2023-06-28 01:56:57,983 INFO [train.py:996] (3/4) Epoch 11, batch 19600, loss[loss=0.1831, simple_loss=0.2498, pruned_loss=0.05817, over 21191.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.278, pruned_loss=0.06237, over 4279586.36 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 01:58:43,055 INFO [train.py:996] (3/4) Epoch 11, batch 19650, loss[loss=0.2024, simple_loss=0.275, pruned_loss=0.06487, over 21863.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2826, pruned_loss=0.06551, over 4285161.09 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 01:59:29,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.082e+02 7.409e+02 1.104e+03 1.587e+03 3.520e+03, threshold=2.207e+03, percent-clipped=14.0 2023-06-28 02:00:14,934 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1947816.0, ans=0.125 2023-06-28 02:00:39,362 INFO [train.py:996] (3/4) Epoch 11, batch 19700, loss[loss=0.2571, simple_loss=0.3418, pruned_loss=0.08625, over 21490.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2866, pruned_loss=0.06647, over 4281210.63 frames. ], batch size: 508, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:00:58,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1947876.0, ans=0.07 2023-06-28 02:01:14,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1947936.0, ans=0.125 2023-06-28 02:01:20,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1947936.0, ans=0.125 2023-06-28 02:01:29,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1947996.0, ans=0.2 2023-06-28 02:01:41,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1948056.0, ans=0.0 2023-06-28 02:02:06,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1948116.0, ans=0.0 2023-06-28 02:02:28,056 INFO [train.py:996] (3/4) Epoch 11, batch 19750, loss[loss=0.2266, simple_loss=0.3016, pruned_loss=0.07577, over 21270.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.297, pruned_loss=0.06796, over 4283701.53 frames. ], batch size: 143, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:02:30,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1948176.0, ans=0.125 2023-06-28 02:03:04,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.869e+02 8.060e+02 1.121e+03 1.722e+03 5.088e+03, threshold=2.243e+03, percent-clipped=14.0 2023-06-28 02:03:13,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1948296.0, ans=0.125 2023-06-28 02:03:19,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1948296.0, ans=0.0 2023-06-28 02:04:10,943 INFO [train.py:996] (3/4) Epoch 11, batch 19800, loss[loss=0.1787, simple_loss=0.258, pruned_loss=0.04968, over 21798.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2977, pruned_loss=0.06867, over 4284893.15 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:04:56,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1948596.0, ans=0.1 2023-06-28 02:05:08,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1948596.0, ans=0.05 2023-06-28 02:05:49,550 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1948716.0, ans=0.125 2023-06-28 02:06:00,833 INFO [train.py:996] (3/4) Epoch 11, batch 19850, loss[loss=0.1331, simple_loss=0.1913, pruned_loss=0.03747, over 16679.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2898, pruned_loss=0.06401, over 4273711.00 frames. ], batch size: 60, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:06:09,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1948776.0, ans=0.0 2023-06-28 02:06:32,734 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.738e+02 8.105e+02 1.255e+03 1.783e+03 2.882e+03, threshold=2.510e+03, percent-clipped=10.0 2023-06-28 02:06:49,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1948896.0, ans=0.125 2023-06-28 02:07:18,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1948956.0, ans=0.2 2023-06-28 02:07:23,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1949016.0, ans=0.0 2023-06-28 02:07:40,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=22.5 2023-06-28 02:07:42,429 INFO [train.py:996] (3/4) Epoch 11, batch 19900, loss[loss=0.1961, simple_loss=0.2912, pruned_loss=0.05048, over 21796.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2887, pruned_loss=0.06135, over 4283159.26 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:08:16,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1949196.0, ans=15.0 2023-06-28 02:09:25,679 INFO [train.py:996] (3/4) Epoch 11, batch 19950, loss[loss=0.1647, simple_loss=0.2262, pruned_loss=0.05165, over 20714.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2835, pruned_loss=0.06111, over 4270914.47 frames. ], batch size: 607, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:09:47,537 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=18.14 vs. limit=22.5 2023-06-28 02:09:58,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.624e+02 6.449e+02 8.969e+02 1.295e+03 2.845e+03, threshold=1.794e+03, percent-clipped=2.0 2023-06-28 02:10:07,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-28 02:10:09,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1949496.0, ans=0.2 2023-06-28 02:10:11,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1949496.0, ans=0.0 2023-06-28 02:11:07,740 INFO [train.py:996] (3/4) Epoch 11, batch 20000, loss[loss=0.2257, simple_loss=0.3051, pruned_loss=0.07312, over 21778.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2855, pruned_loss=0.06197, over 4281172.88 frames. ], batch size: 112, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:12:05,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1949796.0, ans=0.1 2023-06-28 02:12:49,283 INFO [train.py:996] (3/4) Epoch 11, batch 20050, loss[loss=0.2094, simple_loss=0.2834, pruned_loss=0.06773, over 21812.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2865, pruned_loss=0.0634, over 4284079.75 frames. ], batch size: 298, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:13:27,884 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.048e+02 6.719e+02 1.022e+03 1.464e+03 2.848e+03, threshold=2.043e+03, percent-clipped=12.0 2023-06-28 02:13:50,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1950156.0, ans=0.125 2023-06-28 02:14:33,042 INFO [train.py:996] (3/4) Epoch 11, batch 20100, loss[loss=0.2128, simple_loss=0.312, pruned_loss=0.05687, over 21852.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2888, pruned_loss=0.06577, over 4285696.78 frames. ], batch size: 332, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:14:35,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1950276.0, ans=0.0 2023-06-28 02:14:49,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=22.5 2023-06-28 02:15:17,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1950396.0, ans=10.0 2023-06-28 02:16:16,908 INFO [train.py:996] (3/4) Epoch 11, batch 20150, loss[loss=0.2634, simple_loss=0.3283, pruned_loss=0.09924, over 21335.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2988, pruned_loss=0.06963, over 4282305.39 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:16:43,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1950636.0, ans=0.2 2023-06-28 02:17:06,277 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.168e+02 7.352e+02 1.035e+03 1.689e+03 3.687e+03, threshold=2.071e+03, percent-clipped=15.0 2023-06-28 02:17:25,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1950696.0, ans=0.125 2023-06-28 02:17:52,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1950816.0, ans=0.125 2023-06-28 02:18:07,648 INFO [train.py:996] (3/4) Epoch 11, batch 20200, loss[loss=0.2034, simple_loss=0.2917, pruned_loss=0.05756, over 21277.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3044, pruned_loss=0.07208, over 4280702.96 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:18:41,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1950936.0, ans=0.0 2023-06-28 02:18:56,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1950996.0, ans=0.0 2023-06-28 02:19:01,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1950996.0, ans=0.125 2023-06-28 02:19:23,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1951056.0, ans=0.125 2023-06-28 02:19:51,073 INFO [train.py:996] (3/4) Epoch 11, batch 20250, loss[loss=0.1675, simple_loss=0.2719, pruned_loss=0.0315, over 19694.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3049, pruned_loss=0.07053, over 4275557.35 frames. ], batch size: 702, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:20:33,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1951236.0, ans=0.1 2023-06-28 02:20:39,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.229e+02 6.229e+02 9.670e+02 1.265e+03 2.835e+03, threshold=1.934e+03, percent-clipped=7.0 2023-06-28 02:21:00,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1951356.0, ans=0.0 2023-06-28 02:21:37,850 INFO [train.py:996] (3/4) Epoch 11, batch 20300, loss[loss=0.2121, simple_loss=0.3202, pruned_loss=0.05197, over 20853.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3035, pruned_loss=0.06848, over 4266158.48 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:22:20,658 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1951596.0, ans=0.0 2023-06-28 02:22:22,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1951596.0, ans=0.0 2023-06-28 02:22:25,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1951596.0, ans=0.0 2023-06-28 02:22:41,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1951656.0, ans=0.2 2023-06-28 02:22:45,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1951656.0, ans=0.0 2023-06-28 02:23:09,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1951716.0, ans=0.0 2023-06-28 02:23:13,330 INFO [train.py:996] (3/4) Epoch 11, batch 20350, loss[loss=0.1855, simple_loss=0.2546, pruned_loss=0.05823, over 20015.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3034, pruned_loss=0.06874, over 4258894.53 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:23:17,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1951776.0, ans=0.2 2023-06-28 02:23:55,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1951896.0, ans=0.0 2023-06-28 02:24:01,021 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.709e+02 6.451e+02 8.868e+02 1.412e+03 2.811e+03, threshold=1.774e+03, percent-clipped=7.0 2023-06-28 02:24:01,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1951896.0, ans=0.125 2023-06-28 02:24:13,509 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 02:24:33,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1952016.0, ans=0.125 2023-06-28 02:24:56,168 INFO [train.py:996] (3/4) Epoch 11, batch 20400, loss[loss=0.2265, simple_loss=0.3043, pruned_loss=0.07433, over 21670.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3054, pruned_loss=0.07119, over 4256416.69 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:25:52,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1952196.0, ans=0.0 2023-06-28 02:26:24,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-28 02:26:34,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1952316.0, ans=0.125 2023-06-28 02:26:37,003 INFO [train.py:996] (3/4) Epoch 11, batch 20450, loss[loss=0.2573, simple_loss=0.3181, pruned_loss=0.09826, over 21552.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3052, pruned_loss=0.07232, over 4242522.00 frames. ], batch size: 471, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:27:00,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1952376.0, ans=0.1 2023-06-28 02:27:25,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.694e+02 8.138e+02 1.140e+03 1.534e+03 2.680e+03, threshold=2.280e+03, percent-clipped=12.0 2023-06-28 02:27:35,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1952496.0, ans=0.1 2023-06-28 02:27:37,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1952496.0, ans=0.125 2023-06-28 02:27:40,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1952556.0, ans=0.1 2023-06-28 02:28:03,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1952616.0, ans=0.0 2023-06-28 02:28:06,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1952616.0, ans=0.125 2023-06-28 02:28:17,756 INFO [train.py:996] (3/4) Epoch 11, batch 20500, loss[loss=0.2073, simple_loss=0.2783, pruned_loss=0.06811, over 21843.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3005, pruned_loss=0.07252, over 4246350.72 frames. ], batch size: 107, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:28:46,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1952736.0, ans=0.2 2023-06-28 02:28:52,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-28 02:29:02,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1952796.0, ans=0.2 2023-06-28 02:29:49,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1952916.0, ans=0.07 2023-06-28 02:30:04,137 INFO [train.py:996] (3/4) Epoch 11, batch 20550, loss[loss=0.2407, simple_loss=0.3259, pruned_loss=0.07773, over 21565.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2926, pruned_loss=0.07059, over 4244114.91 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:30:04,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1952976.0, ans=0.125 2023-06-28 02:30:20,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1952976.0, ans=0.125 2023-06-28 02:30:38,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1953036.0, ans=0.04949747468305833 2023-06-28 02:30:49,267 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.859e+02 7.218e+02 1.038e+03 1.367e+03 4.804e+03, threshold=2.077e+03, percent-clipped=4.0 2023-06-28 02:31:21,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1953156.0, ans=0.0 2023-06-28 02:31:38,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1953216.0, ans=0.125 2023-06-28 02:31:42,647 INFO [train.py:996] (3/4) Epoch 11, batch 20600, loss[loss=0.219, simple_loss=0.2834, pruned_loss=0.07729, over 21526.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2955, pruned_loss=0.06864, over 4242797.99 frames. ], batch size: 211, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:31:57,548 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=22.5 2023-06-28 02:33:05,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1953516.0, ans=0.125 2023-06-28 02:33:25,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1953516.0, ans=0.1 2023-06-28 02:33:28,455 INFO [train.py:996] (3/4) Epoch 11, batch 20650, loss[loss=0.1925, simple_loss=0.2669, pruned_loss=0.05904, over 21738.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2925, pruned_loss=0.069, over 4252271.16 frames. ], batch size: 316, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:33:37,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-28 02:34:13,039 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.450e+02 6.069e+02 8.420e+02 1.112e+03 2.688e+03, threshold=1.684e+03, percent-clipped=4.0 2023-06-28 02:34:25,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1953696.0, ans=0.1 2023-06-28 02:34:51,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1953816.0, ans=0.0 2023-06-28 02:35:11,580 INFO [train.py:996] (3/4) Epoch 11, batch 20700, loss[loss=0.1962, simple_loss=0.2874, pruned_loss=0.05251, over 21774.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2849, pruned_loss=0.06591, over 4256822.20 frames. ], batch size: 351, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:35:35,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1953876.0, ans=0.125 2023-06-28 02:35:41,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1953936.0, ans=0.2 2023-06-28 02:36:51,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1954116.0, ans=0.2 2023-06-28 02:37:05,864 INFO [train.py:996] (3/4) Epoch 11, batch 20750, loss[loss=0.2545, simple_loss=0.3432, pruned_loss=0.08293, over 21522.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2881, pruned_loss=0.06577, over 4257014.31 frames. ], batch size: 471, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:37:10,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-28 02:37:15,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1954176.0, ans=0.0 2023-06-28 02:37:46,762 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.132e+02 6.969e+02 1.049e+03 1.420e+03 3.386e+03, threshold=2.099e+03, percent-clipped=18.0 2023-06-28 02:38:00,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1954356.0, ans=0.125 2023-06-28 02:38:48,452 INFO [train.py:996] (3/4) Epoch 11, batch 20800, loss[loss=0.2023, simple_loss=0.2668, pruned_loss=0.06892, over 21188.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2915, pruned_loss=0.06671, over 4264102.81 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:38:53,991 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1954476.0, ans=0.0 2023-06-28 02:38:54,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1954476.0, ans=0.125 2023-06-28 02:40:29,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1954776.0, ans=0.125 2023-06-28 02:40:30,174 INFO [train.py:996] (3/4) Epoch 11, batch 20850, loss[loss=0.1882, simple_loss=0.2635, pruned_loss=0.05646, over 21401.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2835, pruned_loss=0.06427, over 4260980.69 frames. ], batch size: 194, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:41:11,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.093e+02 6.813e+02 9.986e+02 1.626e+03 4.926e+03, threshold=1.997e+03, percent-clipped=17.0 2023-06-28 02:41:12,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1954896.0, ans=0.125 2023-06-28 02:41:17,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1954896.0, ans=0.2 2023-06-28 02:42:12,911 INFO [train.py:996] (3/4) Epoch 11, batch 20900, loss[loss=0.2169, simple_loss=0.3018, pruned_loss=0.06603, over 21875.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.284, pruned_loss=0.06523, over 4265268.10 frames. ], batch size: 107, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:42:13,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1955076.0, ans=0.0 2023-06-28 02:42:27,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1955136.0, ans=0.125 2023-06-28 02:42:31,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1955136.0, ans=0.0 2023-06-28 02:42:33,224 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.40 vs. limit=10.0 2023-06-28 02:42:57,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1955196.0, ans=0.025 2023-06-28 02:43:44,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1955316.0, ans=0.0 2023-06-28 02:43:46,459 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-28 02:43:46,919 INFO [train.py:996] (3/4) Epoch 11, batch 20950, loss[loss=0.1791, simple_loss=0.26, pruned_loss=0.04905, over 21288.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2816, pruned_loss=0.06307, over 4265372.47 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:43:54,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1955376.0, ans=0.0 2023-06-28 02:44:09,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1955436.0, ans=0.2 2023-06-28 02:44:26,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.631e+02 6.600e+02 1.009e+03 1.481e+03 3.746e+03, threshold=2.018e+03, percent-clipped=8.0 2023-06-28 02:44:29,310 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.35 vs. limit=12.0 2023-06-28 02:44:32,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1955496.0, ans=0.0 2023-06-28 02:44:43,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1955556.0, ans=0.0 2023-06-28 02:45:25,834 INFO [train.py:996] (3/4) Epoch 11, batch 21000, loss[loss=0.2077, simple_loss=0.2843, pruned_loss=0.06556, over 21805.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.2818, pruned_loss=0.06349, over 4253790.58 frames. ], batch size: 112, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:45:25,835 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 02:45:45,780 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2661, simple_loss=0.3574, pruned_loss=0.08743, over 1796401.00 frames. 2023-06-28 02:45:45,781 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 02:45:48,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1955676.0, ans=0.125 2023-06-28 02:47:10,944 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=22.5 2023-06-28 02:47:22,980 INFO [train.py:996] (3/4) Epoch 11, batch 21050, loss[loss=0.1851, simple_loss=0.2427, pruned_loss=0.0638, over 21275.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2792, pruned_loss=0.06329, over 4261611.86 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:48:06,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1956096.0, ans=0.1 2023-06-28 02:48:09,007 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.407e+02 6.035e+02 7.930e+02 1.297e+03 2.545e+03, threshold=1.586e+03, percent-clipped=7.0 2023-06-28 02:48:15,205 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-28 02:49:04,831 INFO [train.py:996] (3/4) Epoch 11, batch 21100, loss[loss=0.1872, simple_loss=0.2563, pruned_loss=0.05907, over 21477.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.276, pruned_loss=0.06286, over 4258716.85 frames. ], batch size: 132, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:50:07,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1956456.0, ans=0.125 2023-06-28 02:50:15,369 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1956456.0, ans=0.1 2023-06-28 02:50:28,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1956516.0, ans=0.0 2023-06-28 02:50:29,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1956516.0, ans=0.125 2023-06-28 02:50:40,600 INFO [train.py:996] (3/4) Epoch 11, batch 21150, loss[loss=0.1811, simple_loss=0.2469, pruned_loss=0.05766, over 21397.00 frames. ], tot_loss[loss=0.1986, simple_loss=0.2719, pruned_loss=0.06267, over 4254357.07 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 02:50:41,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1956576.0, ans=0.0 2023-06-28 02:50:59,225 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1956576.0, ans=0.2 2023-06-28 02:51:09,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1956636.0, ans=0.125 2023-06-28 02:51:12,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-28 02:51:17,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1956696.0, ans=0.1 2023-06-28 02:51:26,153 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.779e+02 6.313e+02 9.139e+02 1.246e+03 3.367e+03, threshold=1.828e+03, percent-clipped=14.0 2023-06-28 02:51:36,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1956696.0, ans=0.2 2023-06-28 02:51:39,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1956756.0, ans=0.125 2023-06-28 02:52:03,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1956816.0, ans=0.125 2023-06-28 02:52:16,462 INFO [train.py:996] (3/4) Epoch 11, batch 21200, loss[loss=0.1666, simple_loss=0.2311, pruned_loss=0.05106, over 15535.00 frames. ], tot_loss[loss=0.1968, simple_loss=0.2695, pruned_loss=0.06204, over 4242020.72 frames. ], batch size: 60, lr: 2.63e-03, grad_scale: 32.0 2023-06-28 02:52:28,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1956876.0, ans=0.125 2023-06-28 02:52:33,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1956876.0, ans=0.0 2023-06-28 02:52:48,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1956936.0, ans=0.125 2023-06-28 02:52:53,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1956996.0, ans=0.125 2023-06-28 02:53:57,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1957176.0, ans=0.09899494936611666 2023-06-28 02:53:58,190 INFO [train.py:996] (3/4) Epoch 11, batch 21250, loss[loss=0.197, simple_loss=0.2772, pruned_loss=0.05841, over 21653.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2676, pruned_loss=0.06194, over 4248174.09 frames. ], batch size: 298, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:54:08,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1957176.0, ans=0.125 2023-06-28 02:54:10,251 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1957176.0, ans=0.1 2023-06-28 02:54:10,768 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=22.5 2023-06-28 02:54:47,826 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.642e+02 7.284e+02 1.070e+03 1.587e+03 2.954e+03, threshold=2.141e+03, percent-clipped=16.0 2023-06-28 02:54:51,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1957296.0, ans=0.0 2023-06-28 02:55:15,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1957416.0, ans=0.1 2023-06-28 02:55:21,043 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1957416.0, ans=0.1 2023-06-28 02:55:39,436 INFO [train.py:996] (3/4) Epoch 11, batch 21300, loss[loss=0.2173, simple_loss=0.295, pruned_loss=0.06981, over 21924.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.273, pruned_loss=0.06424, over 4251216.96 frames. ], batch size: 333, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:55:56,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1957476.0, ans=0.0 2023-06-28 02:56:18,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1957596.0, ans=0.0 2023-06-28 02:57:22,789 INFO [train.py:996] (3/4) Epoch 11, batch 21350, loss[loss=0.1773, simple_loss=0.2755, pruned_loss=0.03955, over 21773.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2777, pruned_loss=0.06457, over 4265662.76 frames. ], batch size: 282, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:57:42,833 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-28 02:58:07,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1957896.0, ans=0.125 2023-06-28 02:58:08,236 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.369e+02 7.362e+02 1.168e+03 1.519e+03 3.106e+03, threshold=2.337e+03, percent-clipped=14.0 2023-06-28 02:58:09,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-28 02:58:25,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1957956.0, ans=0.125 2023-06-28 02:59:07,603 INFO [train.py:996] (3/4) Epoch 11, batch 21400, loss[loss=0.228, simple_loss=0.3075, pruned_loss=0.0743, over 21355.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2811, pruned_loss=0.064, over 4270131.34 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 02:59:13,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1958076.0, ans=0.0 2023-06-28 02:59:41,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1958136.0, ans=0.125 2023-06-28 02:59:42,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1958136.0, ans=0.2 2023-06-28 03:00:36,410 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1958316.0, ans=0.2 2023-06-28 03:00:44,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1958316.0, ans=0.05 2023-06-28 03:00:49,132 INFO [train.py:996] (3/4) Epoch 11, batch 21450, loss[loss=0.2112, simple_loss=0.293, pruned_loss=0.06472, over 21799.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2852, pruned_loss=0.06607, over 4278764.57 frames. ], batch size: 124, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 03:01:33,847 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.846e+02 6.247e+02 7.898e+02 1.203e+03 2.207e+03, threshold=1.580e+03, percent-clipped=0.0 2023-06-28 03:02:06,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1958556.0, ans=0.2 2023-06-28 03:02:30,234 INFO [train.py:996] (3/4) Epoch 11, batch 21500, loss[loss=0.1818, simple_loss=0.2512, pruned_loss=0.05619, over 21565.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2831, pruned_loss=0.0672, over 4282797.61 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 03:03:06,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-06-28 03:03:23,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1958796.0, ans=0.0 2023-06-28 03:03:24,523 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=15.0 2023-06-28 03:03:58,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1958916.0, ans=0.0 2023-06-28 03:04:06,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1958916.0, ans=0.1 2023-06-28 03:04:11,236 INFO [train.py:996] (3/4) Epoch 11, batch 21550, loss[loss=0.1464, simple_loss=0.2222, pruned_loss=0.03528, over 21616.00 frames. ], tot_loss[loss=0.2025, simple_loss=0.2761, pruned_loss=0.06449, over 4267833.58 frames. ], batch size: 263, lr: 2.63e-03, grad_scale: 8.0 2023-06-28 03:04:17,572 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-28 03:04:43,669 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-28 03:04:56,024 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 6.233e+02 9.531e+02 1.253e+03 2.671e+03, threshold=1.906e+03, percent-clipped=10.0 2023-06-28 03:05:19,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1959156.0, ans=0.125 2023-06-28 03:05:49,817 INFO [train.py:996] (3/4) Epoch 11, batch 21600, loss[loss=0.1881, simple_loss=0.2611, pruned_loss=0.05762, over 21201.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2717, pruned_loss=0.06292, over 4273401.59 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:07:11,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1959456.0, ans=0.0 2023-06-28 03:07:14,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1959516.0, ans=0.125 2023-06-28 03:07:22,406 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.88 vs. limit=8.0 2023-06-28 03:07:37,236 INFO [train.py:996] (3/4) Epoch 11, batch 21650, loss[loss=0.2614, simple_loss=0.356, pruned_loss=0.08336, over 21535.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2763, pruned_loss=0.06178, over 4266968.60 frames. ], batch size: 471, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:07:47,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1959576.0, ans=0.2 2023-06-28 03:08:17,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1959696.0, ans=0.125 2023-06-28 03:08:26,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.567e+02 7.101e+02 1.132e+03 1.604e+03 3.542e+03, threshold=2.263e+03, percent-clipped=14.0 2023-06-28 03:09:18,420 INFO [train.py:996] (3/4) Epoch 11, batch 21700, loss[loss=0.1795, simple_loss=0.2857, pruned_loss=0.03665, over 19827.00 frames. ], tot_loss[loss=0.1988, simple_loss=0.2774, pruned_loss=0.06013, over 4267110.23 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:09:28,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1959876.0, ans=0.2 2023-06-28 03:09:33,989 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-28 03:09:40,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1959936.0, ans=0.07 2023-06-28 03:10:25,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.75 vs. limit=15.0 2023-06-28 03:10:45,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1960116.0, ans=0.2 2023-06-28 03:11:00,007 INFO [train.py:996] (3/4) Epoch 11, batch 21750, loss[loss=0.1781, simple_loss=0.239, pruned_loss=0.05862, over 21504.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2745, pruned_loss=0.06075, over 4257367.80 frames. ], batch size: 212, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:11:08,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1960176.0, ans=0.125 2023-06-28 03:11:43,944 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.642e+02 7.621e+02 1.214e+03 1.880e+03 3.851e+03, threshold=2.427e+03, percent-clipped=16.0 2023-06-28 03:11:54,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1960356.0, ans=0.2 2023-06-28 03:12:37,218 INFO [train.py:996] (3/4) Epoch 11, batch 21800, loss[loss=0.2424, simple_loss=0.335, pruned_loss=0.07494, over 21845.00 frames. ], tot_loss[loss=0.1979, simple_loss=0.2723, pruned_loss=0.06175, over 4259257.35 frames. ], batch size: 317, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:12:45,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1960476.0, ans=0.1 2023-06-28 03:13:49,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1960656.0, ans=0.1 2023-06-28 03:14:15,438 INFO [train.py:996] (3/4) Epoch 11, batch 21850, loss[loss=0.2094, simple_loss=0.2832, pruned_loss=0.0678, over 21826.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2768, pruned_loss=0.0622, over 4269262.09 frames. ], batch size: 124, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:14:16,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.13 vs. limit=10.0 2023-06-28 03:14:17,843 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1960776.0, ans=0.5 2023-06-28 03:14:24,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1960776.0, ans=0.125 2023-06-28 03:14:44,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1960836.0, ans=0.125 2023-06-28 03:14:51,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-28 03:14:52,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1960896.0, ans=0.0 2023-06-28 03:14:54,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1960896.0, ans=0.0 2023-06-28 03:15:00,556 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.488e+02 6.380e+02 8.991e+02 1.412e+03 2.394e+03, threshold=1.798e+03, percent-clipped=0.0 2023-06-28 03:15:17,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1960956.0, ans=0.125 2023-06-28 03:15:22,883 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-28 03:15:27,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1960956.0, ans=15.0 2023-06-28 03:15:35,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1961016.0, ans=0.0 2023-06-28 03:15:52,999 INFO [train.py:996] (3/4) Epoch 11, batch 21900, loss[loss=0.2375, simple_loss=0.3007, pruned_loss=0.08714, over 21800.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2781, pruned_loss=0.06291, over 4271767.64 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 16.0 2023-06-28 03:16:19,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1961136.0, ans=6.0 2023-06-28 03:16:20,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1961136.0, ans=0.0 2023-06-28 03:16:23,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1961136.0, ans=0.5 2023-06-28 03:16:25,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1961136.0, ans=0.125 2023-06-28 03:16:50,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-28 03:16:53,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-28 03:17:04,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1961256.0, ans=0.2 2023-06-28 03:17:10,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1961316.0, ans=0.2 2023-06-28 03:17:22,945 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.68 vs. limit=15.0 2023-06-28 03:17:29,752 INFO [train.py:996] (3/4) Epoch 11, batch 21950, loss[loss=0.1875, simple_loss=0.2344, pruned_loss=0.07025, over 20332.00 frames. ], tot_loss[loss=0.1989, simple_loss=0.2729, pruned_loss=0.06242, over 4269498.87 frames. ], batch size: 703, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:17:36,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1961376.0, ans=0.07 2023-06-28 03:17:38,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-28 03:18:23,015 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.683e+02 5.864e+02 6.968e+02 1.003e+03 1.764e+03, threshold=1.394e+03, percent-clipped=0.0 2023-06-28 03:18:34,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1961556.0, ans=0.125 2023-06-28 03:19:11,925 INFO [train.py:996] (3/4) Epoch 11, batch 22000, loss[loss=0.2346, simple_loss=0.3381, pruned_loss=0.06551, over 21228.00 frames. ], tot_loss[loss=0.1933, simple_loss=0.2674, pruned_loss=0.05958, over 4274077.04 frames. ], batch size: 549, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 03:19:23,358 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.13 vs. limit=15.0 2023-06-28 03:19:39,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1961736.0, ans=0.5 2023-06-28 03:19:53,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1961796.0, ans=0.125 2023-06-28 03:20:10,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1961796.0, ans=0.0 2023-06-28 03:20:22,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1961856.0, ans=0.125 2023-06-28 03:20:26,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1961856.0, ans=0.1 2023-06-28 03:20:31,271 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:20:55,946 INFO [train.py:996] (3/4) Epoch 11, batch 22050, loss[loss=0.2273, simple_loss=0.3063, pruned_loss=0.07411, over 21446.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2729, pruned_loss=0.0611, over 4270117.27 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:21:45,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.whiten.whitening_limit, batch_count=1962096.0, ans=12.0 2023-06-28 03:21:53,076 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.146e+02 7.401e+02 1.317e+03 1.911e+03 4.599e+03, threshold=2.634e+03, percent-clipped=46.0 2023-06-28 03:22:40,201 INFO [train.py:996] (3/4) Epoch 11, batch 22100, loss[loss=0.2367, simple_loss=0.3079, pruned_loss=0.08272, over 21785.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2824, pruned_loss=0.06458, over 4256636.07 frames. ], batch size: 414, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:22:43,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.96 vs. limit=22.5 2023-06-28 03:23:28,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1962396.0, ans=0.125 2023-06-28 03:23:42,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-06-28 03:24:01,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1962516.0, ans=0.125 2023-06-28 03:24:17,364 INFO [train.py:996] (3/4) Epoch 11, batch 22150, loss[loss=0.2146, simple_loss=0.2839, pruned_loss=0.07262, over 21562.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.286, pruned_loss=0.06614, over 4267529.70 frames. ], batch size: 195, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:24:21,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=12.0 2023-06-28 03:25:05,710 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1962696.0, ans=0.125 2023-06-28 03:25:13,699 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.995e+02 8.783e+02 1.255e+03 1.849e+03 4.260e+03, threshold=2.511e+03, percent-clipped=3.0 2023-06-28 03:25:14,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1962696.0, ans=0.125 2023-06-28 03:25:42,884 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-28 03:25:46,152 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-28 03:26:00,174 INFO [train.py:996] (3/4) Epoch 11, batch 22200, loss[loss=0.198, simple_loss=0.292, pruned_loss=0.05196, over 21366.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2872, pruned_loss=0.06693, over 4275707.31 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:26:17,353 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:26:22,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-28 03:26:30,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1962936.0, ans=0.2 2023-06-28 03:26:31,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.20 vs. limit=15.0 2023-06-28 03:26:58,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1962996.0, ans=0.125 2023-06-28 03:27:22,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1963056.0, ans=0.0 2023-06-28 03:27:42,225 INFO [train.py:996] (3/4) Epoch 11, batch 22250, loss[loss=0.25, simple_loss=0.3352, pruned_loss=0.08236, over 21755.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2934, pruned_loss=0.06864, over 4282156.63 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:28:08,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1963236.0, ans=0.125 2023-06-28 03:28:13,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1963236.0, ans=0.125 2023-06-28 03:28:37,931 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.198e+02 6.711e+02 8.468e+02 1.239e+03 3.194e+03, threshold=1.694e+03, percent-clipped=5.0 2023-06-28 03:28:48,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1963356.0, ans=0.125 2023-06-28 03:28:50,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1963356.0, ans=0.1 2023-06-28 03:29:27,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1963476.0, ans=0.125 2023-06-28 03:29:28,290 INFO [train.py:996] (3/4) Epoch 11, batch 22300, loss[loss=0.2015, simple_loss=0.2674, pruned_loss=0.06779, over 21467.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2959, pruned_loss=0.07049, over 4283848.69 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:29:48,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1963536.0, ans=0.0 2023-06-28 03:29:51,193 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1963536.0, ans=0.0 2023-06-28 03:30:20,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1963596.0, ans=0.125 2023-06-28 03:30:30,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1963656.0, ans=0.0 2023-06-28 03:31:14,591 INFO [train.py:996] (3/4) Epoch 11, batch 22350, loss[loss=0.1813, simple_loss=0.2569, pruned_loss=0.05282, over 21260.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2942, pruned_loss=0.07132, over 4290432.98 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:31:30,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1963836.0, ans=0.2 2023-06-28 03:32:01,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.484e+02 6.278e+02 9.923e+02 1.351e+03 2.767e+03, threshold=1.985e+03, percent-clipped=14.0 2023-06-28 03:32:58,057 INFO [train.py:996] (3/4) Epoch 11, batch 22400, loss[loss=0.1875, simple_loss=0.2616, pruned_loss=0.05672, over 21551.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2919, pruned_loss=0.06932, over 4277904.62 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 03:34:37,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1964316.0, ans=0.1 2023-06-28 03:34:39,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1964376.0, ans=0.0 2023-06-28 03:34:40,473 INFO [train.py:996] (3/4) Epoch 11, batch 22450, loss[loss=0.1841, simple_loss=0.2567, pruned_loss=0.0557, over 21371.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2858, pruned_loss=0.06787, over 4280388.87 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:34:45,063 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-28 03:34:45,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=15.0 2023-06-28 03:35:09,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-28 03:35:35,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.371e+02 5.963e+02 8.267e+02 1.246e+03 2.225e+03, threshold=1.653e+03, percent-clipped=2.0 2023-06-28 03:35:56,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1964556.0, ans=0.125 2023-06-28 03:36:08,616 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.86 vs. limit=10.0 2023-06-28 03:36:21,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1964616.0, ans=0.0 2023-06-28 03:36:22,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1964676.0, ans=0.0 2023-06-28 03:36:24,031 INFO [train.py:996] (3/4) Epoch 11, batch 22500, loss[loss=0.2286, simple_loss=0.3209, pruned_loss=0.06819, over 21486.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2821, pruned_loss=0.06709, over 4284463.63 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:37:40,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1964856.0, ans=0.0 2023-06-28 03:38:06,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1964976.0, ans=0.0 2023-06-28 03:38:07,198 INFO [train.py:996] (3/4) Epoch 11, batch 22550, loss[loss=0.1891, simple_loss=0.2665, pruned_loss=0.05581, over 21810.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2841, pruned_loss=0.06669, over 4290023.92 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:38:35,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1965036.0, ans=0.125 2023-06-28 03:38:49,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.41 vs. limit=10.0 2023-06-28 03:39:03,638 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.783e+02 6.887e+02 1.011e+03 1.935e+03 4.167e+03, threshold=2.022e+03, percent-clipped=31.0 2023-06-28 03:39:12,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1965156.0, ans=0.1 2023-06-28 03:39:23,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1965156.0, ans=0.125 2023-06-28 03:39:33,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1965216.0, ans=0.0 2023-06-28 03:39:50,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1965216.0, ans=0.125 2023-06-28 03:39:51,822 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1965216.0, ans=0.125 2023-06-28 03:39:56,245 INFO [train.py:996] (3/4) Epoch 11, batch 22600, loss[loss=0.2089, simple_loss=0.2978, pruned_loss=0.06003, over 21758.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2872, pruned_loss=0.06667, over 4289172.82 frames. ], batch size: 351, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:40:10,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1965276.0, ans=0.125 2023-06-28 03:40:28,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1965336.0, ans=0.0 2023-06-28 03:40:35,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1965396.0, ans=0.125 2023-06-28 03:40:50,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1965396.0, ans=0.5 2023-06-28 03:41:04,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1965456.0, ans=0.0 2023-06-28 03:41:33,171 INFO [train.py:996] (3/4) Epoch 11, batch 22650, loss[loss=0.2015, simple_loss=0.2601, pruned_loss=0.0715, over 21444.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2841, pruned_loss=0.06702, over 4284680.16 frames. ], batch size: 389, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:41:38,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1965576.0, ans=0.0 2023-06-28 03:41:38,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.53 vs. limit=6.0 2023-06-28 03:41:40,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1965576.0, ans=0.0 2023-06-28 03:41:41,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1965576.0, ans=0.0 2023-06-28 03:41:43,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1965576.0, ans=0.0 2023-06-28 03:41:51,653 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1965636.0, ans=0.2 2023-06-28 03:42:09,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1965696.0, ans=0.125 2023-06-28 03:42:26,673 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.890e+02 8.413e+02 1.340e+03 1.745e+03 3.098e+03, threshold=2.679e+03, percent-clipped=14.0 2023-06-28 03:43:14,252 INFO [train.py:996] (3/4) Epoch 11, batch 22700, loss[loss=0.1712, simple_loss=0.2438, pruned_loss=0.04933, over 21811.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2776, pruned_loss=0.06657, over 4281700.33 frames. ], batch size: 118, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:43:26,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1965876.0, ans=0.07 2023-06-28 03:43:34,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1965936.0, ans=0.05 2023-06-28 03:44:20,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1966056.0, ans=15.0 2023-06-28 03:44:56,803 INFO [train.py:996] (3/4) Epoch 11, batch 22750, loss[loss=0.2077, simple_loss=0.2848, pruned_loss=0.06531, over 21936.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2807, pruned_loss=0.06725, over 4269648.21 frames. ], batch size: 316, lr: 2.62e-03, grad_scale: 8.0 2023-06-28 03:44:58,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1966176.0, ans=0.1 2023-06-28 03:45:55,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.898e+02 9.181e+02 1.363e+03 2.029e+03 5.534e+03, threshold=2.727e+03, percent-clipped=14.0 2023-06-28 03:46:19,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1966416.0, ans=0.1 2023-06-28 03:46:31,673 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-28 03:46:35,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.20 vs. limit=8.0 2023-06-28 03:46:38,649 INFO [train.py:996] (3/4) Epoch 11, batch 22800, loss[loss=0.1666, simple_loss=0.2213, pruned_loss=0.05593, over 20757.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2838, pruned_loss=0.06832, over 4268224.34 frames. ], batch size: 607, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:47:23,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1966596.0, ans=0.1 2023-06-28 03:47:57,132 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=15.0 2023-06-28 03:48:20,558 INFO [train.py:996] (3/4) Epoch 11, batch 22850, loss[loss=0.21, simple_loss=0.2784, pruned_loss=0.07086, over 21531.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2808, pruned_loss=0.06766, over 4276681.79 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:48:59,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1966836.0, ans=0.125 2023-06-28 03:49:19,989 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.398e+02 6.821e+02 8.997e+02 1.443e+03 3.960e+03, threshold=1.799e+03, percent-clipped=4.0 2023-06-28 03:49:20,524 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1966896.0, ans=0.125 2023-06-28 03:50:00,375 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-28 03:50:04,231 INFO [train.py:996] (3/4) Epoch 11, batch 22900, loss[loss=0.2215, simple_loss=0.3346, pruned_loss=0.05423, over 21651.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2823, pruned_loss=0.06724, over 4270480.89 frames. ], batch size: 389, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:50:05,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1967076.0, ans=0.125 2023-06-28 03:50:16,018 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-28 03:50:40,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1967136.0, ans=0.0 2023-06-28 03:51:04,428 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-28 03:51:47,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1967376.0, ans=0.125 2023-06-28 03:51:48,482 INFO [train.py:996] (3/4) Epoch 11, batch 22950, loss[loss=0.2454, simple_loss=0.3674, pruned_loss=0.06164, over 21743.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2945, pruned_loss=0.06689, over 4264427.04 frames. ], batch size: 332, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:51:50,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1967376.0, ans=0.2 2023-06-28 03:52:26,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1967436.0, ans=0.125 2023-06-28 03:52:35,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1967496.0, ans=0.05 2023-06-28 03:52:37,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1967496.0, ans=0.95 2023-06-28 03:52:42,072 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.491e+02 7.317e+02 1.405e+03 2.219e+03 4.116e+03, threshold=2.810e+03, percent-clipped=42.0 2023-06-28 03:52:50,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1967556.0, ans=0.0 2023-06-28 03:53:07,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1967616.0, ans=0.2 2023-06-28 03:53:25,447 INFO [train.py:996] (3/4) Epoch 11, batch 23000, loss[loss=0.2003, simple_loss=0.2794, pruned_loss=0.06059, over 21900.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2927, pruned_loss=0.06522, over 4269786.03 frames. ], batch size: 316, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:53:58,075 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1967736.0, ans=10.0 2023-06-28 03:53:58,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-28 03:54:06,552 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=12.0 2023-06-28 03:54:09,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1967736.0, ans=0.0 2023-06-28 03:54:27,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1967856.0, ans=0.0 2023-06-28 03:55:11,922 INFO [train.py:996] (3/4) Epoch 11, batch 23050, loss[loss=0.2319, simple_loss=0.3066, pruned_loss=0.07858, over 21821.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.295, pruned_loss=0.06727, over 4270866.17 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:55:13,168 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=12.0 2023-06-28 03:56:02,464 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.806e+02 7.903e+02 1.210e+03 1.646e+03 4.576e+03, threshold=2.420e+03, percent-clipped=5.0 2023-06-28 03:56:07,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1968156.0, ans=0.125 2023-06-28 03:56:31,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1968216.0, ans=0.125 2023-06-28 03:56:41,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1968216.0, ans=0.125 2023-06-28 03:56:50,501 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1968216.0, ans=0.125 2023-06-28 03:56:54,614 INFO [train.py:996] (3/4) Epoch 11, batch 23100, loss[loss=0.1875, simple_loss=0.2566, pruned_loss=0.05918, over 21945.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2915, pruned_loss=0.06702, over 4274514.99 frames. ], batch size: 113, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:57:10,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1968276.0, ans=0.0 2023-06-28 03:58:36,205 INFO [train.py:996] (3/4) Epoch 11, batch 23150, loss[loss=0.2205, simple_loss=0.2945, pruned_loss=0.0733, over 21842.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2857, pruned_loss=0.06611, over 4279702.59 frames. ], batch size: 118, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 03:59:20,956 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.155e+02 6.572e+02 9.609e+02 1.447e+03 3.666e+03, threshold=1.922e+03, percent-clipped=4.0 2023-06-28 03:59:24,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1968756.0, ans=0.1 2023-06-28 03:59:26,543 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 03:59:42,516 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-28 03:59:48,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1968816.0, ans=0.2 2023-06-28 04:00:06,814 INFO [train.py:996] (3/4) Epoch 11, batch 23200, loss[loss=0.1892, simple_loss=0.2595, pruned_loss=0.05941, over 21895.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2842, pruned_loss=0.06666, over 4274231.65 frames. ], batch size: 283, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:00:14,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1968876.0, ans=0.0 2023-06-28 04:01:39,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1969116.0, ans=0.0 2023-06-28 04:01:48,925 INFO [train.py:996] (3/4) Epoch 11, batch 23250, loss[loss=0.228, simple_loss=0.3002, pruned_loss=0.07795, over 21842.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.284, pruned_loss=0.06757, over 4283467.52 frames. ], batch size: 351, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:02:27,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1969296.0, ans=0.125 2023-06-28 04:02:29,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1969296.0, ans=0.1 2023-06-28 04:02:32,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1969296.0, ans=0.0 2023-06-28 04:02:42,562 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.956e+02 7.376e+02 1.130e+03 1.714e+03 3.374e+03, threshold=2.260e+03, percent-clipped=21.0 2023-06-28 04:03:03,898 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-28 04:03:34,408 INFO [train.py:996] (3/4) Epoch 11, batch 23300, loss[loss=0.2162, simple_loss=0.3178, pruned_loss=0.05735, over 21454.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2918, pruned_loss=0.06929, over 4291838.18 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:04:28,215 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-28 04:04:44,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1969656.0, ans=0.0 2023-06-28 04:05:18,304 INFO [train.py:996] (3/4) Epoch 11, batch 23350, loss[loss=0.1951, simple_loss=0.2757, pruned_loss=0.0573, over 21349.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2942, pruned_loss=0.06766, over 4294396.37 frames. ], batch size: 131, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:05:24,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-28 04:05:25,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1969776.0, ans=0.0 2023-06-28 04:05:35,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1969836.0, ans=0.2 2023-06-28 04:06:14,290 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.812e+02 6.998e+02 1.084e+03 1.696e+03 4.677e+03, threshold=2.169e+03, percent-clipped=9.0 2023-06-28 04:06:23,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1969956.0, ans=0.0 2023-06-28 04:06:25,529 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.88 vs. limit=10.0 2023-06-28 04:06:44,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1970016.0, ans=0.2 2023-06-28 04:07:00,184 INFO [train.py:996] (3/4) Epoch 11, batch 23400, loss[loss=0.1735, simple_loss=0.2238, pruned_loss=0.06158, over 20017.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2885, pruned_loss=0.06486, over 4290642.32 frames. ], batch size: 703, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:07:11,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1970076.0, ans=0.125 2023-06-28 04:07:48,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1970196.0, ans=10.0 2023-06-28 04:08:18,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1970256.0, ans=0.04949747468305833 2023-06-28 04:08:36,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1970316.0, ans=0.125 2023-06-28 04:08:38,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1970316.0, ans=0.125 2023-06-28 04:08:42,809 INFO [train.py:996] (3/4) Epoch 11, batch 23450, loss[loss=0.201, simple_loss=0.275, pruned_loss=0.06352, over 21589.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2915, pruned_loss=0.06684, over 4288163.44 frames. ], batch size: 263, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:08:45,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1970376.0, ans=0.1 2023-06-28 04:09:39,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.169e+02 9.066e+02 1.305e+03 2.110e+03 3.921e+03, threshold=2.611e+03, percent-clipped=24.0 2023-06-28 04:10:06,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1970616.0, ans=0.125 2023-06-28 04:10:19,189 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:10:20,266 INFO [train.py:996] (3/4) Epoch 11, batch 23500, loss[loss=0.1985, simple_loss=0.2675, pruned_loss=0.06479, over 21173.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2927, pruned_loss=0.06871, over 4293478.64 frames. ], batch size: 608, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:10:49,074 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1970736.0, ans=0.125 2023-06-28 04:11:43,104 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-28 04:11:49,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1970916.0, ans=0.0 2023-06-28 04:11:56,906 INFO [train.py:996] (3/4) Epoch 11, batch 23550, loss[loss=0.1813, simple_loss=0.2486, pruned_loss=0.05699, over 21739.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2876, pruned_loss=0.0687, over 4277544.15 frames. ], batch size: 316, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:12:56,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.515e+02 7.071e+02 9.804e+02 1.415e+03 2.782e+03, threshold=1.961e+03, percent-clipped=2.0 2023-06-28 04:13:23,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1971216.0, ans=0.1 2023-06-28 04:13:33,841 INFO [train.py:996] (3/4) Epoch 11, batch 23600, loss[loss=0.2161, simple_loss=0.2913, pruned_loss=0.07045, over 21619.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2883, pruned_loss=0.06819, over 4277073.47 frames. ], batch size: 263, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:13:54,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1971336.0, ans=0.125 2023-06-28 04:14:47,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1971456.0, ans=0.125 2023-06-28 04:15:18,255 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.43 vs. limit=15.0 2023-06-28 04:15:22,095 INFO [train.py:996] (3/4) Epoch 11, batch 23650, loss[loss=0.1982, simple_loss=0.2783, pruned_loss=0.05902, over 21476.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2878, pruned_loss=0.06647, over 4284566.95 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:16:25,959 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.038e+02 7.663e+02 1.286e+03 2.404e+03 4.690e+03, threshold=2.571e+03, percent-clipped=33.0 2023-06-28 04:17:05,545 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-28 04:17:11,304 INFO [train.py:996] (3/4) Epoch 11, batch 23700, loss[loss=0.1823, simple_loss=0.2654, pruned_loss=0.04962, over 21691.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2895, pruned_loss=0.06721, over 4282048.04 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:18:03,069 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.04 vs. limit=15.0 2023-06-28 04:18:39,968 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-28 04:18:45,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1972116.0, ans=0.0 2023-06-28 04:18:55,613 INFO [train.py:996] (3/4) Epoch 11, batch 23750, loss[loss=0.1878, simple_loss=0.2794, pruned_loss=0.04816, over 21269.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2928, pruned_loss=0.0676, over 4279599.58 frames. ], batch size: 176, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:19:10,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1972176.0, ans=0.1 2023-06-28 04:19:22,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1972236.0, ans=0.1 2023-06-28 04:19:24,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1972236.0, ans=0.0 2023-06-28 04:19:56,812 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1972296.0, ans=0.125 2023-06-28 04:19:59,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.432e+02 7.586e+02 1.231e+03 1.988e+03 4.114e+03, threshold=2.463e+03, percent-clipped=17.0 2023-06-28 04:20:23,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-28 04:20:49,625 INFO [train.py:996] (3/4) Epoch 11, batch 23800, loss[loss=0.3033, simple_loss=0.3794, pruned_loss=0.1136, over 21430.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2926, pruned_loss=0.06638, over 4280506.49 frames. ], batch size: 507, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:20:50,347 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1972476.0, ans=0.0 2023-06-28 04:20:58,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1972476.0, ans=0.05 2023-06-28 04:21:12,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1972536.0, ans=0.1 2023-06-28 04:21:27,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1972596.0, ans=0.125 2023-06-28 04:21:33,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1972596.0, ans=0.125 2023-06-28 04:21:38,197 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=22.5 2023-06-28 04:22:30,796 INFO [train.py:996] (3/4) Epoch 11, batch 23850, loss[loss=0.2529, simple_loss=0.3327, pruned_loss=0.08658, over 21500.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2991, pruned_loss=0.06782, over 4284204.84 frames. ], batch size: 131, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:22:47,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1972836.0, ans=0.0 2023-06-28 04:23:07,986 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1972836.0, ans=0.125 2023-06-28 04:23:30,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.838e+02 1.015e+03 1.727e+03 2.965e+03 4.931e+03, threshold=3.454e+03, percent-clipped=27.0 2023-06-28 04:23:42,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1972956.0, ans=0.125 2023-06-28 04:23:45,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1972956.0, ans=0.0 2023-06-28 04:23:52,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1972956.0, ans=0.0 2023-06-28 04:24:14,930 INFO [train.py:996] (3/4) Epoch 11, batch 23900, loss[loss=0.1921, simple_loss=0.2658, pruned_loss=0.05924, over 21448.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3048, pruned_loss=0.06985, over 4279914.77 frames. ], batch size: 212, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:24:24,112 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1973076.0, ans=0.125 2023-06-28 04:24:48,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1973136.0, ans=0.125 2023-06-28 04:25:00,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1973196.0, ans=0.125 2023-06-28 04:25:57,203 INFO [train.py:996] (3/4) Epoch 11, batch 23950, loss[loss=0.2232, simple_loss=0.2963, pruned_loss=0.07502, over 21178.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3002, pruned_loss=0.06954, over 4269254.48 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:26:36,718 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1973436.0, ans=0.125 2023-06-28 04:26:43,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1973496.0, ans=0.2 2023-06-28 04:27:00,772 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=22.5 2023-06-28 04:27:01,140 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.639e+02 7.884e+02 1.240e+03 1.758e+03 3.648e+03, threshold=2.481e+03, percent-clipped=1.0 2023-06-28 04:27:01,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1973556.0, ans=0.125 2023-06-28 04:27:20,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.60 vs. limit=6.0 2023-06-28 04:27:22,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1973616.0, ans=0.125 2023-06-28 04:27:29,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1973616.0, ans=0.125 2023-06-28 04:27:40,578 INFO [train.py:996] (3/4) Epoch 11, batch 24000, loss[loss=0.2692, simple_loss=0.3539, pruned_loss=0.09224, over 21832.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3011, pruned_loss=0.07172, over 4272219.59 frames. ], batch size: 118, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:27:40,578 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 04:28:01,238 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2606, simple_loss=0.3539, pruned_loss=0.08365, over 1796401.00 frames. 2023-06-28 04:28:01,239 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 04:28:21,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.18 vs. limit=10.0 2023-06-28 04:28:49,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1973796.0, ans=0.125 2023-06-28 04:29:00,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1973796.0, ans=0.2 2023-06-28 04:29:44,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1973976.0, ans=0.0 2023-06-28 04:29:45,819 INFO [train.py:996] (3/4) Epoch 11, batch 24050, loss[loss=0.2474, simple_loss=0.3354, pruned_loss=0.07973, over 21488.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.3028, pruned_loss=0.07173, over 4282081.71 frames. ], batch size: 471, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:30:00,231 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-28 04:30:08,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1974036.0, ans=0.0 2023-06-28 04:30:50,230 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.159e+02 7.180e+02 1.052e+03 1.636e+03 2.739e+03, threshold=2.104e+03, percent-clipped=1.0 2023-06-28 04:31:06,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1974216.0, ans=0.125 2023-06-28 04:31:33,795 INFO [train.py:996] (3/4) Epoch 11, batch 24100, loss[loss=0.2602, simple_loss=0.3371, pruned_loss=0.09169, over 21601.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3013, pruned_loss=0.06997, over 4279380.51 frames. ], batch size: 389, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:31:49,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1974336.0, ans=0.2 2023-06-28 04:31:58,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1974336.0, ans=0.0 2023-06-28 04:32:44,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1974456.0, ans=0.05 2023-06-28 04:33:06,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=22.5 2023-06-28 04:33:14,887 INFO [train.py:996] (3/4) Epoch 11, batch 24150, loss[loss=0.2, simple_loss=0.2723, pruned_loss=0.0639, over 21850.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.302, pruned_loss=0.07123, over 4281709.25 frames. ], batch size: 298, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:33:15,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1974576.0, ans=0.0 2023-06-28 04:34:14,498 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.271e+02 8.001e+02 1.203e+03 1.842e+03 3.600e+03, threshold=2.405e+03, percent-clipped=13.0 2023-06-28 04:34:34,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1974756.0, ans=0.125 2023-06-28 04:34:58,328 INFO [train.py:996] (3/4) Epoch 11, batch 24200, loss[loss=0.2147, simple_loss=0.297, pruned_loss=0.06618, over 21506.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3038, pruned_loss=0.07222, over 4279376.00 frames. ], batch size: 195, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:35:14,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.64 vs. limit=6.0 2023-06-28 04:36:22,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1975056.0, ans=0.0 2023-06-28 04:36:28,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1975116.0, ans=0.0 2023-06-28 04:36:47,137 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-28 04:36:47,566 INFO [train.py:996] (3/4) Epoch 11, batch 24250, loss[loss=0.2269, simple_loss=0.3016, pruned_loss=0.07612, over 21478.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.3012, pruned_loss=0.06796, over 4272366.71 frames. ], batch size: 548, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:37:48,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.666e+02 6.261e+02 9.348e+02 1.527e+03 2.867e+03, threshold=1.870e+03, percent-clipped=6.0 2023-06-28 04:37:48,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1975356.0, ans=0.125 2023-06-28 04:38:35,069 INFO [train.py:996] (3/4) Epoch 11, batch 24300, loss[loss=0.1986, simple_loss=0.2758, pruned_loss=0.06073, over 21743.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2949, pruned_loss=0.06301, over 4275436.05 frames. ], batch size: 414, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:39:40,641 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-28 04:39:49,192 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-28 04:40:16,733 INFO [train.py:996] (3/4) Epoch 11, batch 24350, loss[loss=0.2293, simple_loss=0.3046, pruned_loss=0.07703, over 21806.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2905, pruned_loss=0.06163, over 4278764.00 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-28 04:40:45,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1975836.0, ans=0.125 2023-06-28 04:41:02,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1975896.0, ans=0.0 2023-06-28 04:41:15,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1975956.0, ans=0.125 2023-06-28 04:41:16,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.173e+02 7.216e+02 1.198e+03 1.667e+03 3.137e+03, threshold=2.397e+03, percent-clipped=16.0 2023-06-28 04:41:20,870 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-28 04:41:59,534 INFO [train.py:996] (3/4) Epoch 11, batch 24400, loss[loss=0.205, simple_loss=0.2919, pruned_loss=0.05898, over 21685.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2932, pruned_loss=0.06454, over 4279128.11 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 32.0 2023-06-28 04:42:07,402 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1976076.0, ans=0.125 2023-06-28 04:42:12,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1976076.0, ans=0.125 2023-06-28 04:42:17,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1976136.0, ans=0.125 2023-06-28 04:42:28,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1976136.0, ans=0.125 2023-06-28 04:43:26,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1976316.0, ans=0.125 2023-06-28 04:43:42,609 INFO [train.py:996] (3/4) Epoch 11, batch 24450, loss[loss=0.2334, simple_loss=0.3239, pruned_loss=0.07143, over 21614.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2964, pruned_loss=0.06657, over 4281673.65 frames. ], batch size: 263, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:44:35,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1976496.0, ans=0.125 2023-06-28 04:44:48,221 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.546e+02 6.657e+02 8.727e+02 1.270e+03 2.887e+03, threshold=1.745e+03, percent-clipped=2.0 2023-06-28 04:45:06,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1976616.0, ans=0.0 2023-06-28 04:45:14,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1976616.0, ans=0.125 2023-06-28 04:45:24,282 INFO [train.py:996] (3/4) Epoch 11, batch 24500, loss[loss=0.1989, simple_loss=0.2869, pruned_loss=0.05542, over 21890.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2973, pruned_loss=0.06706, over 4284347.23 frames. ], batch size: 316, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:46:01,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1976736.0, ans=0.125 2023-06-28 04:46:07,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1976796.0, ans=0.0 2023-06-28 04:46:26,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=22.5 2023-06-28 04:46:42,593 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-28 04:46:49,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1976916.0, ans=0.0 2023-06-28 04:46:51,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1976916.0, ans=0.125 2023-06-28 04:47:01,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-06-28 04:47:07,104 INFO [train.py:996] (3/4) Epoch 11, batch 24550, loss[loss=0.219, simple_loss=0.2977, pruned_loss=0.07011, over 20764.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.3004, pruned_loss=0.06816, over 4280330.52 frames. ], batch size: 607, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:47:15,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1976976.0, ans=0.125 2023-06-28 04:47:22,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1976976.0, ans=0.125 2023-06-28 04:47:25,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1976976.0, ans=0.2 2023-06-28 04:48:18,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.753e+02 7.977e+02 1.391e+03 1.923e+03 3.873e+03, threshold=2.782e+03, percent-clipped=31.0 2023-06-28 04:48:35,344 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1977216.0, ans=0.125 2023-06-28 04:48:51,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1977216.0, ans=0.125 2023-06-28 04:48:54,437 INFO [train.py:996] (3/4) Epoch 11, batch 24600, loss[loss=0.1923, simple_loss=0.2678, pruned_loss=0.05842, over 21836.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2974, pruned_loss=0.06856, over 4279451.99 frames. ], batch size: 317, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:49:34,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=12.0 2023-06-28 04:50:13,989 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1977516.0, ans=0.125 2023-06-28 04:50:37,063 INFO [train.py:996] (3/4) Epoch 11, batch 24650, loss[loss=0.1852, simple_loss=0.2499, pruned_loss=0.06023, over 21585.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2908, pruned_loss=0.06845, over 4277877.43 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:51:42,514 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.777e+02 8.360e+02 1.097e+03 1.550e+03 2.969e+03, threshold=2.194e+03, percent-clipped=2.0 2023-06-28 04:52:06,294 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1977816.0, ans=0.07 2023-06-28 04:52:19,285 INFO [train.py:996] (3/4) Epoch 11, batch 24700, loss[loss=0.2137, simple_loss=0.2809, pruned_loss=0.07328, over 21442.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2884, pruned_loss=0.06753, over 4277665.59 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:52:19,967 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 04:52:48,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1977936.0, ans=0.09899494936611666 2023-06-28 04:53:38,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1978056.0, ans=0.2 2023-06-28 04:54:01,974 INFO [train.py:996] (3/4) Epoch 11, batch 24750, loss[loss=0.168, simple_loss=0.2382, pruned_loss=0.04888, over 21321.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2803, pruned_loss=0.06502, over 4279444.85 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:54:09,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1978176.0, ans=0.125 2023-06-28 04:54:10,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1978176.0, ans=0.125 2023-06-28 04:54:26,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.whiten.whitening_limit, batch_count=1978236.0, ans=12.0 2023-06-28 04:54:59,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1978296.0, ans=0.125 2023-06-28 04:55:07,374 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.038e+02 5.891e+02 8.003e+02 1.099e+03 2.127e+03, threshold=1.601e+03, percent-clipped=0.0 2023-06-28 04:55:07,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1978356.0, ans=0.0 2023-06-28 04:55:17,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1978356.0, ans=0.0 2023-06-28 04:55:25,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1978416.0, ans=0.0 2023-06-28 04:55:29,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1978416.0, ans=0.1 2023-06-28 04:55:38,486 INFO [train.py:996] (3/4) Epoch 11, batch 24800, loss[loss=0.2352, simple_loss=0.2797, pruned_loss=0.09537, over 21545.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2758, pruned_loss=0.06499, over 4288881.43 frames. ], batch size: 508, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 04:55:44,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1978476.0, ans=0.125 2023-06-28 04:55:46,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1978476.0, ans=0.125 2023-06-28 04:56:13,605 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-28 04:56:54,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1978656.0, ans=0.125 2023-06-28 04:57:01,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1978656.0, ans=0.125 2023-06-28 04:57:14,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1978716.0, ans=0.0 2023-06-28 04:57:22,248 INFO [train.py:996] (3/4) Epoch 11, batch 24850, loss[loss=0.1853, simple_loss=0.2454, pruned_loss=0.06254, over 20104.00 frames. ], tot_loss[loss=0.2048, simple_loss=0.2766, pruned_loss=0.06649, over 4294220.67 frames. ], batch size: 703, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 04:57:24,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1978776.0, ans=0.5 2023-06-28 04:57:26,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1978776.0, ans=0.1 2023-06-28 04:57:46,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1978776.0, ans=0.125 2023-06-28 04:58:35,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.508e+02 8.527e+02 1.164e+03 1.873e+03 3.084e+03, threshold=2.328e+03, percent-clipped=28.0 2023-06-28 04:58:39,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1978956.0, ans=0.0 2023-06-28 04:59:09,778 INFO [train.py:996] (3/4) Epoch 11, batch 24900, loss[loss=0.1867, simple_loss=0.24, pruned_loss=0.06669, over 20346.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2792, pruned_loss=0.06735, over 4286807.62 frames. ], batch size: 703, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:00:04,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1979196.0, ans=0.1 2023-06-28 05:00:25,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.99 vs. limit=22.5 2023-06-28 05:00:27,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1979256.0, ans=0.125 2023-06-28 05:00:33,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1979316.0, ans=0.5 2023-06-28 05:00:45,662 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1979316.0, ans=0.1 2023-06-28 05:00:45,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1979316.0, ans=0.125 2023-06-28 05:00:58,713 INFO [train.py:996] (3/4) Epoch 11, batch 24950, loss[loss=0.2607, simple_loss=0.3285, pruned_loss=0.0965, over 21116.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2881, pruned_loss=0.07129, over 4289219.85 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:01:06,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1979376.0, ans=0.0 2023-06-28 05:01:39,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1979436.0, ans=0.125 2023-06-28 05:01:49,622 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1979496.0, ans=0.1 2023-06-28 05:02:01,610 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1979556.0, ans=0.0 2023-06-28 05:02:04,368 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.368e+02 8.687e+02 1.291e+03 2.049e+03 3.753e+03, threshold=2.582e+03, percent-clipped=19.0 2023-06-28 05:02:20,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.41 vs. limit=10.0 2023-06-28 05:02:21,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1979616.0, ans=0.125 2023-06-28 05:02:42,786 INFO [train.py:996] (3/4) Epoch 11, batch 25000, loss[loss=0.2014, simple_loss=0.2724, pruned_loss=0.06522, over 21739.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2933, pruned_loss=0.07231, over 4283370.72 frames. ], batch size: 333, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:02:45,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1979676.0, ans=0.125 2023-06-28 05:03:28,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1979796.0, ans=0.125 2023-06-28 05:03:41,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1979856.0, ans=0.0 2023-06-28 05:04:14,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1979916.0, ans=0.0 2023-06-28 05:04:21,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1979916.0, ans=0.125 2023-06-28 05:04:25,862 INFO [train.py:996] (3/4) Epoch 11, batch 25050, loss[loss=0.1904, simple_loss=0.2567, pruned_loss=0.06207, over 21974.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2886, pruned_loss=0.07025, over 4276480.11 frames. ], batch size: 103, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:04:51,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2023-06-28 05:05:07,153 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=12.0 2023-06-28 05:05:13,817 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-28 05:05:37,081 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.870e+02 6.206e+02 8.703e+02 1.312e+03 2.418e+03, threshold=1.741e+03, percent-clipped=0.0 2023-06-28 05:05:41,754 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.82 vs. limit=22.5 2023-06-28 05:05:55,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1980216.0, ans=0.0 2023-06-28 05:06:09,895 INFO [train.py:996] (3/4) Epoch 11, batch 25100, loss[loss=0.2086, simple_loss=0.2992, pruned_loss=0.05896, over 21822.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.283, pruned_loss=0.0688, over 4274431.33 frames. ], batch size: 371, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:06:20,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1980276.0, ans=0.125 2023-06-28 05:07:04,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1980396.0, ans=0.0 2023-06-28 05:07:51,374 INFO [train.py:996] (3/4) Epoch 11, batch 25150, loss[loss=0.2186, simple_loss=0.3032, pruned_loss=0.06699, over 21698.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2843, pruned_loss=0.06718, over 4254201.84 frames. ], batch size: 389, lr: 2.61e-03, grad_scale: 8.0 2023-06-28 05:07:55,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980576.0, ans=0.1 2023-06-28 05:07:57,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1980576.0, ans=0.1 2023-06-28 05:08:15,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1980636.0, ans=0.125 2023-06-28 05:08:55,625 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.446e+02 6.553e+02 1.065e+03 1.530e+03 2.529e+03, threshold=2.131e+03, percent-clipped=15.0 2023-06-28 05:09:01,976 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-28 05:09:06,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1980816.0, ans=0.0 2023-06-28 05:09:09,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1980816.0, ans=0.125 2023-06-28 05:09:15,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1980816.0, ans=0.0 2023-06-28 05:09:28,755 INFO [train.py:996] (3/4) Epoch 11, batch 25200, loss[loss=0.1836, simple_loss=0.2577, pruned_loss=0.05479, over 21874.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2829, pruned_loss=0.06482, over 4258820.11 frames. ], batch size: 107, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:09:37,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1980876.0, ans=0.0 2023-06-28 05:10:21,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980996.0, ans=0.1 2023-06-28 05:10:34,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1981056.0, ans=0.0 2023-06-28 05:11:10,870 INFO [train.py:996] (3/4) Epoch 11, batch 25250, loss[loss=0.203, simple_loss=0.2743, pruned_loss=0.06585, over 21739.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2804, pruned_loss=0.06335, over 4265805.88 frames. ], batch size: 351, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:11:23,324 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-28 05:11:39,037 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1981236.0, ans=0.125 2023-06-28 05:12:21,393 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.599e+02 7.557e+02 1.172e+03 1.779e+03 3.738e+03, threshold=2.344e+03, percent-clipped=14.0 2023-06-28 05:12:58,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1981476.0, ans=0.0 2023-06-28 05:12:59,822 INFO [train.py:996] (3/4) Epoch 11, batch 25300, loss[loss=0.1743, simple_loss=0.2627, pruned_loss=0.04294, over 21663.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2784, pruned_loss=0.06287, over 4249381.47 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:13:05,832 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-28 05:13:47,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1981596.0, ans=0.0 2023-06-28 05:14:20,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1981716.0, ans=0.125 2023-06-28 05:14:44,497 INFO [train.py:996] (3/4) Epoch 11, batch 25350, loss[loss=0.1816, simple_loss=0.2625, pruned_loss=0.0503, over 21714.00 frames. ], tot_loss[loss=0.2017, simple_loss=0.2799, pruned_loss=0.06177, over 4238620.94 frames. ], batch size: 333, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:14:54,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1981776.0, ans=0.1 2023-06-28 05:15:02,908 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 05:15:22,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1981836.0, ans=0.0 2023-06-28 05:15:53,135 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 7.550e+02 1.200e+03 1.857e+03 4.350e+03, threshold=2.399e+03, percent-clipped=14.0 2023-06-28 05:16:25,269 INFO [train.py:996] (3/4) Epoch 11, batch 25400, loss[loss=0.1895, simple_loss=0.2542, pruned_loss=0.06245, over 21239.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2771, pruned_loss=0.06177, over 4250086.85 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:17:48,195 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1982316.0, ans=0.125 2023-06-28 05:18:01,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1982316.0, ans=0.125 2023-06-28 05:18:07,477 INFO [train.py:996] (3/4) Epoch 11, batch 25450, loss[loss=0.1919, simple_loss=0.2778, pruned_loss=0.05297, over 21322.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2772, pruned_loss=0.06267, over 4260659.54 frames. ], batch size: 159, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:19:15,754 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=21.50 vs. limit=22.5 2023-06-28 05:19:17,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.365e+02 6.795e+02 1.021e+03 1.795e+03 3.141e+03, threshold=2.041e+03, percent-clipped=7.0 2023-06-28 05:19:36,509 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1982616.0, ans=0.125 2023-06-28 05:19:56,327 INFO [train.py:996] (3/4) Epoch 11, batch 25500, loss[loss=0.1757, simple_loss=0.252, pruned_loss=0.04963, over 21338.00 frames. ], tot_loss[loss=0.2003, simple_loss=0.2788, pruned_loss=0.06087, over 4261870.91 frames. ], batch size: 159, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:20:14,008 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-28 05:21:12,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1982916.0, ans=0.1 2023-06-28 05:21:39,927 INFO [train.py:996] (3/4) Epoch 11, batch 25550, loss[loss=0.2117, simple_loss=0.2964, pruned_loss=0.06348, over 16583.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2855, pruned_loss=0.06122, over 4240837.93 frames. ], batch size: 60, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:22:33,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1983096.0, ans=0.2 2023-06-28 05:22:41,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1983156.0, ans=0.2 2023-06-28 05:22:44,325 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.429e+02 7.335e+02 1.015e+03 1.599e+03 3.312e+03, threshold=2.031e+03, percent-clipped=14.0 2023-06-28 05:22:55,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1983156.0, ans=0.125 2023-06-28 05:23:07,097 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1983216.0, ans=0.125 2023-06-28 05:23:28,329 INFO [train.py:996] (3/4) Epoch 11, batch 25600, loss[loss=0.3, simple_loss=0.3583, pruned_loss=0.1208, over 21355.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2902, pruned_loss=0.06277, over 4248061.73 frames. ], batch size: 507, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 05:23:47,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1983336.0, ans=0.0 2023-06-28 05:24:19,693 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-28 05:24:44,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1983456.0, ans=0.02 2023-06-28 05:25:10,612 INFO [train.py:996] (3/4) Epoch 11, batch 25650, loss[loss=0.2182, simple_loss=0.2748, pruned_loss=0.0808, over 21236.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2921, pruned_loss=0.06512, over 4252734.62 frames. ], batch size: 471, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:25:17,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1983576.0, ans=0.2 2023-06-28 05:25:18,610 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-06-28 05:25:49,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1983696.0, ans=0.125 2023-06-28 05:26:21,286 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.706e+02 6.784e+02 1.002e+03 1.536e+03 3.689e+03, threshold=2.004e+03, percent-clipped=11.0 2023-06-28 05:26:30,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1983816.0, ans=0.0 2023-06-28 05:26:35,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1983816.0, ans=0.125 2023-06-28 05:26:52,926 INFO [train.py:996] (3/4) Epoch 11, batch 25700, loss[loss=0.2336, simple_loss=0.293, pruned_loss=0.0871, over 21727.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2885, pruned_loss=0.06609, over 4256598.67 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:28:24,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1984116.0, ans=0.1 2023-06-28 05:28:32,205 INFO [train.py:996] (3/4) Epoch 11, batch 25750, loss[loss=0.2282, simple_loss=0.3044, pruned_loss=0.07598, over 21340.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.292, pruned_loss=0.06747, over 4260467.91 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:29:00,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1984236.0, ans=0.0 2023-06-28 05:29:30,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1984296.0, ans=0.125 2023-06-28 05:29:50,528 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 8.292e+02 1.215e+03 2.235e+03 4.745e+03, threshold=2.430e+03, percent-clipped=27.0 2023-06-28 05:29:53,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1984356.0, ans=0.0 2023-06-28 05:29:54,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1984356.0, ans=0.125 2023-06-28 05:30:23,124 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2023-06-28 05:30:23,219 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-28 05:30:23,502 INFO [train.py:996] (3/4) Epoch 11, batch 25800, loss[loss=0.2215, simple_loss=0.2991, pruned_loss=0.072, over 21478.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3038, pruned_loss=0.07173, over 4264439.59 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:30:39,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1984536.0, ans=0.0 2023-06-28 05:31:02,893 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1984536.0, ans=0.125 2023-06-28 05:31:10,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1984596.0, ans=0.0 2023-06-28 05:31:12,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1984596.0, ans=0.2 2023-06-28 05:31:31,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1984656.0, ans=0.2 2023-06-28 05:32:05,405 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1984776.0, ans=0.0 2023-06-28 05:32:06,405 INFO [train.py:996] (3/4) Epoch 11, batch 25850, loss[loss=0.2164, simple_loss=0.3019, pruned_loss=0.06547, over 20071.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3048, pruned_loss=0.0715, over 4268726.72 frames. ], batch size: 703, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:32:11,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1984776.0, ans=0.125 2023-06-28 05:32:18,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1984776.0, ans=0.0 2023-06-28 05:33:18,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.240e+02 7.750e+02 1.095e+03 1.413e+03 4.702e+03, threshold=2.190e+03, percent-clipped=3.0 2023-06-28 05:33:45,967 INFO [train.py:996] (3/4) Epoch 11, batch 25900, loss[loss=0.1847, simple_loss=0.2547, pruned_loss=0.05731, over 21207.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3064, pruned_loss=0.07236, over 4278013.24 frames. ], batch size: 607, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:34:28,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1985196.0, ans=0.2 2023-06-28 05:34:45,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1985196.0, ans=0.125 2023-06-28 05:35:17,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1985316.0, ans=0.125 2023-06-28 05:35:18,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1985316.0, ans=0.035 2023-06-28 05:35:29,665 INFO [train.py:996] (3/4) Epoch 11, batch 25950, loss[loss=0.2432, simple_loss=0.3214, pruned_loss=0.08251, over 21580.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3116, pruned_loss=0.07445, over 4274987.67 frames. ], batch size: 414, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:35:58,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1985436.0, ans=0.125 2023-06-28 05:36:41,749 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.536e+02 7.393e+02 8.893e+02 1.407e+03 4.224e+03, threshold=1.779e+03, percent-clipped=8.0 2023-06-28 05:36:52,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1985616.0, ans=0.2 2023-06-28 05:37:18,832 INFO [train.py:996] (3/4) Epoch 11, batch 26000, loss[loss=0.2266, simple_loss=0.3115, pruned_loss=0.07091, over 21716.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3118, pruned_loss=0.07376, over 4268292.62 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 05:38:04,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1985796.0, ans=0.2 2023-06-28 05:38:50,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1985916.0, ans=0.0 2023-06-28 05:39:00,973 INFO [train.py:996] (3/4) Epoch 11, batch 26050, loss[loss=0.2216, simple_loss=0.2943, pruned_loss=0.07447, over 21872.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.312, pruned_loss=0.07457, over 4272520.34 frames. ], batch size: 371, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:39:01,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1985976.0, ans=0.0 2023-06-28 05:39:13,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1985976.0, ans=0.125 2023-06-28 05:40:03,335 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.740e+02 6.927e+02 9.303e+02 1.315e+03 2.564e+03, threshold=1.861e+03, percent-clipped=11.0 2023-06-28 05:40:37,572 INFO [train.py:996] (3/4) Epoch 11, batch 26100, loss[loss=0.191, simple_loss=0.2648, pruned_loss=0.05861, over 21942.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.306, pruned_loss=0.07388, over 4279126.57 frames. ], batch size: 316, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:40:49,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1986276.0, ans=0.2 2023-06-28 05:42:10,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1986516.0, ans=0.0 2023-06-28 05:42:25,860 INFO [train.py:996] (3/4) Epoch 11, batch 26150, loss[loss=0.2263, simple_loss=0.3035, pruned_loss=0.07456, over 21593.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3023, pruned_loss=0.07344, over 4285897.73 frames. ], batch size: 389, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:42:54,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1986636.0, ans=0.0 2023-06-28 05:43:13,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1986696.0, ans=0.125 2023-06-28 05:43:16,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1986696.0, ans=0.0 2023-06-28 05:43:40,782 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 6.839e+02 9.051e+02 1.314e+03 2.834e+03, threshold=1.810e+03, percent-clipped=6.0 2023-06-28 05:44:10,840 INFO [train.py:996] (3/4) Epoch 11, batch 26200, loss[loss=0.2129, simple_loss=0.3018, pruned_loss=0.06201, over 21181.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3024, pruned_loss=0.07157, over 4275772.92 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:44:11,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1986876.0, ans=0.125 2023-06-28 05:45:10,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1987056.0, ans=0.125 2023-06-28 05:45:17,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1987056.0, ans=0.0 2023-06-28 05:45:27,403 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=15.0 2023-06-28 05:45:49,437 INFO [train.py:996] (3/4) Epoch 11, batch 26250, loss[loss=0.2352, simple_loss=0.306, pruned_loss=0.08216, over 21337.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.3059, pruned_loss=0.07117, over 4271910.51 frames. ], batch size: 159, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:46:02,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1987176.0, ans=0.0 2023-06-28 05:46:09,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1987236.0, ans=0.0 2023-06-28 05:46:29,933 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=22.5 2023-06-28 05:46:34,547 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 05:47:01,912 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.966e+02 7.302e+02 1.108e+03 1.607e+03 4.168e+03, threshold=2.217e+03, percent-clipped=19.0 2023-06-28 05:47:31,740 INFO [train.py:996] (3/4) Epoch 11, batch 26300, loss[loss=0.222, simple_loss=0.2921, pruned_loss=0.07597, over 21882.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3029, pruned_loss=0.07119, over 4281526.66 frames. ], batch size: 371, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:49:16,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1987716.0, ans=0.125 2023-06-28 05:49:19,385 INFO [train.py:996] (3/4) Epoch 11, batch 26350, loss[loss=0.2483, simple_loss=0.3241, pruned_loss=0.08627, over 21563.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3013, pruned_loss=0.07182, over 4288530.71 frames. ], batch size: 414, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:49:28,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1987776.0, ans=0.1 2023-06-28 05:49:46,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1987836.0, ans=0.125 2023-06-28 05:50:32,387 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.942e+02 8.987e+02 1.115e+03 1.521e+03 3.466e+03, threshold=2.231e+03, percent-clipped=6.0 2023-06-28 05:50:46,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1988016.0, ans=0.1 2023-06-28 05:50:57,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1988016.0, ans=0.0 2023-06-28 05:51:02,145 INFO [train.py:996] (3/4) Epoch 11, batch 26400, loss[loss=0.1888, simple_loss=0.2502, pruned_loss=0.06366, over 21235.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2964, pruned_loss=0.07191, over 4284932.80 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 05:51:02,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1988076.0, ans=0.1 2023-06-28 05:52:48,900 INFO [train.py:996] (3/4) Epoch 11, batch 26450, loss[loss=0.2079, simple_loss=0.2804, pruned_loss=0.06769, over 21137.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2954, pruned_loss=0.07132, over 4273911.62 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:52:55,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1988376.0, ans=0.125 2023-06-28 05:52:59,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1988376.0, ans=0.125 2023-06-28 05:53:24,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1988436.0, ans=0.95 2023-06-28 05:54:09,262 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.671e+02 1.024e+03 1.650e+03 2.442e+03 4.564e+03, threshold=3.300e+03, percent-clipped=28.0 2023-06-28 05:54:11,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1988556.0, ans=0.05 2023-06-28 05:54:35,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1988616.0, ans=0.125 2023-06-28 05:54:37,909 INFO [train.py:996] (3/4) Epoch 11, batch 26500, loss[loss=0.1987, simple_loss=0.2824, pruned_loss=0.0575, over 21728.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2987, pruned_loss=0.0695, over 4279589.64 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:54:49,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1988676.0, ans=0.125 2023-06-28 05:55:14,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1988736.0, ans=0.1 2023-06-28 05:56:28,851 INFO [train.py:996] (3/4) Epoch 11, batch 26550, loss[loss=0.1765, simple_loss=0.2649, pruned_loss=0.04405, over 21629.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2951, pruned_loss=0.06686, over 4275652.71 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:57:13,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1989096.0, ans=0.0 2023-06-28 05:57:29,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1989156.0, ans=0.125 2023-06-28 05:57:38,604 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 7.811e+02 1.294e+03 2.097e+03 4.356e+03, threshold=2.588e+03, percent-clipped=4.0 2023-06-28 05:57:51,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1989156.0, ans=0.0 2023-06-28 05:58:09,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1989276.0, ans=10.0 2023-06-28 05:58:10,614 INFO [train.py:996] (3/4) Epoch 11, batch 26600, loss[loss=0.204, simple_loss=0.2745, pruned_loss=0.06669, over 21318.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.296, pruned_loss=0.06514, over 4280256.79 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:58:39,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1989336.0, ans=0.02 2023-06-28 05:58:42,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1989336.0, ans=0.125 2023-06-28 05:59:48,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1989516.0, ans=0.125 2023-06-28 05:59:52,619 INFO [train.py:996] (3/4) Epoch 11, batch 26650, loss[loss=0.1452, simple_loss=0.2285, pruned_loss=0.03099, over 21513.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2889, pruned_loss=0.06394, over 4268495.26 frames. ], batch size: 195, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 05:59:58,606 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-28 06:00:07,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1989576.0, ans=0.125 2023-06-28 06:00:17,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1989636.0, ans=0.0 2023-06-28 06:01:03,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1989756.0, ans=0.125 2023-06-28 06:01:05,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.789e+02 5.280e+02 6.792e+02 8.539e+02 2.170e+03, threshold=1.358e+03, percent-clipped=0.0 2023-06-28 06:01:24,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1989816.0, ans=0.0 2023-06-28 06:01:33,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.79 vs. limit=15.0 2023-06-28 06:01:33,824 INFO [train.py:996] (3/4) Epoch 11, batch 26700, loss[loss=0.227, simple_loss=0.2927, pruned_loss=0.08061, over 21772.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2833, pruned_loss=0.0624, over 4257249.49 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:01:48,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2023-06-28 06:02:13,567 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.74 vs. limit=10.0 2023-06-28 06:03:01,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1990116.0, ans=0.125 2023-06-28 06:03:16,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1990176.0, ans=0.125 2023-06-28 06:03:18,045 INFO [train.py:996] (3/4) Epoch 11, batch 26750, loss[loss=0.2595, simple_loss=0.3457, pruned_loss=0.08661, over 21445.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2829, pruned_loss=0.06134, over 4262971.83 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:03:52,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1990236.0, ans=0.0 2023-06-28 06:03:52,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1990236.0, ans=0.125 2023-06-28 06:04:02,643 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1990296.0, ans=0.0 2023-06-28 06:04:24,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1990356.0, ans=0.125 2023-06-28 06:04:33,823 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.602e+02 7.194e+02 1.094e+03 1.684e+03 4.507e+03, threshold=2.188e+03, percent-clipped=37.0 2023-06-28 06:04:34,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1990356.0, ans=0.2 2023-06-28 06:05:02,065 INFO [train.py:996] (3/4) Epoch 11, batch 26800, loss[loss=0.2318, simple_loss=0.3114, pruned_loss=0.07607, over 21376.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2891, pruned_loss=0.06518, over 4264665.00 frames. ], batch size: 549, lr: 2.61e-03, grad_scale: 32.0 2023-06-28 06:06:12,531 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.32 vs. limit=10.0 2023-06-28 06:06:43,234 INFO [train.py:996] (3/4) Epoch 11, batch 26850, loss[loss=0.1996, simple_loss=0.2628, pruned_loss=0.06819, over 21565.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2905, pruned_loss=0.06736, over 4265794.13 frames. ], batch size: 391, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:08:02,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.773e+02 7.985e+02 1.116e+03 1.630e+03 3.577e+03, threshold=2.232e+03, percent-clipped=14.0 2023-06-28 06:08:08,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1991016.0, ans=0.1 2023-06-28 06:08:20,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-28 06:08:24,662 INFO [train.py:996] (3/4) Epoch 11, batch 26900, loss[loss=0.1897, simple_loss=0.2562, pruned_loss=0.06158, over 21664.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2822, pruned_loss=0.0667, over 4272997.00 frames. ], batch size: 333, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:08:55,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1991136.0, ans=0.125 2023-06-28 06:09:07,879 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1991136.0, ans=0.0 2023-06-28 06:09:13,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1991196.0, ans=0.0 2023-06-28 06:09:19,367 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1991196.0, ans=0.125 2023-06-28 06:09:28,573 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.93 vs. limit=8.0 2023-06-28 06:10:05,639 INFO [train.py:996] (3/4) Epoch 11, batch 26950, loss[loss=0.2297, simple_loss=0.3208, pruned_loss=0.06932, over 21765.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2835, pruned_loss=0.06756, over 4269965.88 frames. ], batch size: 282, lr: 2.61e-03, grad_scale: 16.0 2023-06-28 06:10:58,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1991496.0, ans=0.0 2023-06-28 06:11:06,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1991496.0, ans=0.125 2023-06-28 06:11:27,132 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.642e+02 6.578e+02 9.395e+02 1.272e+03 2.979e+03, threshold=1.879e+03, percent-clipped=1.0 2023-06-28 06:11:46,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1991676.0, ans=0.125 2023-06-28 06:11:47,975 INFO [train.py:996] (3/4) Epoch 11, batch 27000, loss[loss=0.1664, simple_loss=0.2461, pruned_loss=0.04334, over 21234.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2838, pruned_loss=0.06539, over 4266592.42 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:11:47,975 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 06:12:09,374 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.246, simple_loss=0.3377, pruned_loss=0.07718, over 1796401.00 frames. 2023-06-28 06:12:09,375 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 06:12:34,765 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-06-28 06:12:56,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1991796.0, ans=0.09899494936611666 2023-06-28 06:13:25,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1991916.0, ans=0.125 2023-06-28 06:13:52,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1991976.0, ans=0.125 2023-06-28 06:13:57,804 INFO [train.py:996] (3/4) Epoch 11, batch 27050, loss[loss=0.2019, simple_loss=0.2868, pruned_loss=0.0585, over 21889.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2857, pruned_loss=0.06272, over 4270089.44 frames. ], batch size: 316, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:14:00,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1991976.0, ans=0.1 2023-06-28 06:14:35,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1992036.0, ans=0.125 2023-06-28 06:14:43,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1992096.0, ans=0.125 2023-06-28 06:15:07,777 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.395e+02 5.810e+02 8.142e+02 1.096e+03 2.681e+03, threshold=1.628e+03, percent-clipped=6.0 2023-06-28 06:15:08,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1992156.0, ans=0.1 2023-06-28 06:15:37,370 INFO [train.py:996] (3/4) Epoch 11, batch 27100, loss[loss=0.2216, simple_loss=0.3002, pruned_loss=0.07151, over 21502.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2871, pruned_loss=0.06389, over 4282072.87 frames. ], batch size: 131, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:16:34,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1992396.0, ans=0.1 2023-06-28 06:16:55,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-28 06:17:03,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1992516.0, ans=0.1 2023-06-28 06:17:22,564 INFO [train.py:996] (3/4) Epoch 11, batch 27150, loss[loss=0.343, simple_loss=0.4213, pruned_loss=0.1324, over 21522.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2983, pruned_loss=0.06717, over 4278190.02 frames. ], batch size: 507, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:17:39,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1992636.0, ans=0.1 2023-06-28 06:18:35,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.727e+02 8.827e+02 1.451e+03 2.136e+03 4.044e+03, threshold=2.902e+03, percent-clipped=43.0 2023-06-28 06:18:38,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1992816.0, ans=0.1 2023-06-28 06:18:59,850 INFO [train.py:996] (3/4) Epoch 11, batch 27200, loss[loss=0.2552, simple_loss=0.3355, pruned_loss=0.08742, over 21896.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3081, pruned_loss=0.07, over 4278532.56 frames. ], batch size: 316, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:19:24,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-28 06:19:27,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1992936.0, ans=0.1 2023-06-28 06:19:41,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1992996.0, ans=0.1 2023-06-28 06:19:54,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1992996.0, ans=0.125 2023-06-28 06:20:49,571 INFO [train.py:996] (3/4) Epoch 11, batch 27250, loss[loss=0.2306, simple_loss=0.3039, pruned_loss=0.07868, over 21757.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3103, pruned_loss=0.07329, over 4275461.70 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:21:11,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1993236.0, ans=0.0 2023-06-28 06:22:14,913 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.213e+02 7.322e+02 9.525e+02 1.331e+03 3.028e+03, threshold=1.905e+03, percent-clipped=1.0 2023-06-28 06:22:35,622 INFO [train.py:996] (3/4) Epoch 11, batch 27300, loss[loss=0.2165, simple_loss=0.3001, pruned_loss=0.06643, over 21611.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3116, pruned_loss=0.07365, over 4273405.73 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:22:37,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1993476.0, ans=0.015 2023-06-28 06:23:51,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1993656.0, ans=0.125 2023-06-28 06:23:59,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1993656.0, ans=0.125 2023-06-28 06:24:03,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-06-28 06:24:19,997 INFO [train.py:996] (3/4) Epoch 11, batch 27350, loss[loss=0.2424, simple_loss=0.3328, pruned_loss=0.07597, over 21380.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3125, pruned_loss=0.07396, over 4276828.37 frames. ], batch size: 131, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:24:23,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1993776.0, ans=0.125 2023-06-28 06:24:46,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1993836.0, ans=0.0 2023-06-28 06:24:48,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1993836.0, ans=0.1 2023-06-28 06:25:41,642 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.836e+02 8.634e+02 1.277e+03 1.695e+03 3.535e+03, threshold=2.554e+03, percent-clipped=18.0 2023-06-28 06:25:44,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1994016.0, ans=0.0 2023-06-28 06:25:57,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1994016.0, ans=0.2 2023-06-28 06:26:01,643 INFO [train.py:996] (3/4) Epoch 11, batch 27400, loss[loss=0.2135, simple_loss=0.2816, pruned_loss=0.07272, over 21828.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3082, pruned_loss=0.07362, over 4280120.64 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:26:23,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1994076.0, ans=0.0 2023-06-28 06:27:09,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1994256.0, ans=0.09899494936611666 2023-06-28 06:27:44,104 INFO [train.py:996] (3/4) Epoch 11, batch 27450, loss[loss=0.2339, simple_loss=0.3242, pruned_loss=0.07181, over 21628.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3016, pruned_loss=0.07169, over 4277165.20 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:28:30,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1994496.0, ans=0.0 2023-06-28 06:29:05,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.109e+02 6.770e+02 1.005e+03 1.545e+03 3.220e+03, threshold=2.009e+03, percent-clipped=5.0 2023-06-28 06:29:25,915 INFO [train.py:996] (3/4) Epoch 11, batch 27500, loss[loss=0.2247, simple_loss=0.2947, pruned_loss=0.0774, over 21715.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2989, pruned_loss=0.07139, over 4277213.57 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:29:40,935 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1994676.0, ans=0.125 2023-06-28 06:29:58,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1994736.0, ans=0.125 2023-06-28 06:30:10,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1994736.0, ans=0.2 2023-06-28 06:30:13,375 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1994796.0, ans=0.125 2023-06-28 06:30:34,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1994856.0, ans=0.125 2023-06-28 06:31:16,369 INFO [train.py:996] (3/4) Epoch 11, batch 27550, loss[loss=0.1938, simple_loss=0.2729, pruned_loss=0.05736, over 21544.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2946, pruned_loss=0.06893, over 4280848.40 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:32:24,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1995156.0, ans=0.2 2023-06-28 06:32:28,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.464e+02 6.626e+02 9.640e+02 1.426e+03 2.852e+03, threshold=1.928e+03, percent-clipped=10.0 2023-06-28 06:32:53,149 INFO [train.py:996] (3/4) Epoch 11, batch 27600, loss[loss=0.178, simple_loss=0.2329, pruned_loss=0.06158, over 20732.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2878, pruned_loss=0.06773, over 4277495.56 frames. ], batch size: 609, lr: 2.60e-03, grad_scale: 32.0 2023-06-28 06:33:10,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.25 vs. limit=12.0 2023-06-28 06:33:55,667 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-28 06:34:30,772 INFO [train.py:996] (3/4) Epoch 11, batch 27650, loss[loss=0.1933, simple_loss=0.2694, pruned_loss=0.05859, over 21463.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2825, pruned_loss=0.06678, over 4271350.54 frames. ], batch size: 131, lr: 2.60e-03, grad_scale: 32.0 2023-06-28 06:35:44,457 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1995756.0, ans=0.1 2023-06-28 06:35:54,453 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.549e+02 8.580e+02 1.335e+03 1.838e+03 2.881e+03, threshold=2.670e+03, percent-clipped=20.0 2023-06-28 06:36:00,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1995816.0, ans=0.5 2023-06-28 06:36:02,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1995816.0, ans=0.125 2023-06-28 06:36:17,806 INFO [train.py:996] (3/4) Epoch 11, batch 27700, loss[loss=0.2213, simple_loss=0.2907, pruned_loss=0.07596, over 19963.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2827, pruned_loss=0.06481, over 4270407.46 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:36:18,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1995876.0, ans=0.125 2023-06-28 06:36:41,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1995876.0, ans=0.125 2023-06-28 06:37:07,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1995996.0, ans=0.0 2023-06-28 06:37:35,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1996056.0, ans=0.125 2023-06-28 06:38:04,198 INFO [train.py:996] (3/4) Epoch 11, batch 27750, loss[loss=0.1846, simple_loss=0.277, pruned_loss=0.04616, over 21845.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2856, pruned_loss=0.06445, over 4273956.24 frames. ], batch size: 316, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:38:56,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.65 vs. limit=22.5 2023-06-28 06:39:07,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1996356.0, ans=0.125 2023-06-28 06:39:18,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.799e+02 7.463e+02 1.002e+03 1.388e+03 2.774e+03, threshold=2.003e+03, percent-clipped=1.0 2023-06-28 06:39:22,173 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 06:39:39,534 INFO [train.py:996] (3/4) Epoch 11, batch 27800, loss[loss=0.2227, simple_loss=0.2931, pruned_loss=0.07609, over 21409.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.284, pruned_loss=0.06489, over 4276899.91 frames. ], batch size: 177, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:40:40,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1996596.0, ans=0.125 2023-06-28 06:41:02,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1996716.0, ans=0.09899494936611666 2023-06-28 06:41:09,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=15.0 2023-06-28 06:41:26,072 INFO [train.py:996] (3/4) Epoch 11, batch 27850, loss[loss=0.1987, simple_loss=0.2997, pruned_loss=0.04887, over 21626.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2828, pruned_loss=0.06548, over 4283118.65 frames. ], batch size: 230, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:41:49,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1996836.0, ans=0.2 2023-06-28 06:42:41,135 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.30 vs. limit=22.5 2023-06-28 06:42:49,898 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.063e+02 7.124e+02 9.695e+02 1.441e+03 2.660e+03, threshold=1.939e+03, percent-clipped=8.0 2023-06-28 06:42:55,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1997016.0, ans=0.0 2023-06-28 06:43:02,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1997016.0, ans=0.0 2023-06-28 06:43:02,764 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-28 06:43:12,505 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-28 06:43:16,101 INFO [train.py:996] (3/4) Epoch 11, batch 27900, loss[loss=0.2619, simple_loss=0.3521, pruned_loss=0.08585, over 21733.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2905, pruned_loss=0.06608, over 4287682.20 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:43:59,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1997196.0, ans=0.2 2023-06-28 06:44:57,000 INFO [train.py:996] (3/4) Epoch 11, batch 27950, loss[loss=0.2346, simple_loss=0.3256, pruned_loss=0.07186, over 21695.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2918, pruned_loss=0.06331, over 4282678.72 frames. ], batch size: 441, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:45:04,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-28 06:45:07,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1997376.0, ans=0.125 2023-06-28 06:45:09,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1997376.0, ans=0.125 2023-06-28 06:45:22,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1997436.0, ans=0.1 2023-06-28 06:46:13,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1997556.0, ans=0.07 2023-06-28 06:46:16,320 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-06-28 06:46:18,686 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1997556.0, ans=0.125 2023-06-28 06:46:18,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1997556.0, ans=0.0 2023-06-28 06:46:22,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1997616.0, ans=0.125 2023-06-28 06:46:23,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.171e+02 6.116e+02 8.584e+02 1.262e+03 3.314e+03, threshold=1.717e+03, percent-clipped=6.0 2023-06-28 06:46:39,423 INFO [train.py:996] (3/4) Epoch 11, batch 28000, loss[loss=0.207, simple_loss=0.2887, pruned_loss=0.06263, over 21884.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2906, pruned_loss=0.06155, over 4287895.39 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 06:46:49,445 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1997676.0, ans=0.125 2023-06-28 06:47:05,196 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1997736.0, ans=0.0 2023-06-28 06:47:16,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1997796.0, ans=0.2 2023-06-28 06:48:00,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1997916.0, ans=0.125 2023-06-28 06:48:23,191 INFO [train.py:996] (3/4) Epoch 11, batch 28050, loss[loss=0.1661, simple_loss=0.2347, pruned_loss=0.04881, over 21369.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2884, pruned_loss=0.06293, over 4291830.56 frames. ], batch size: 131, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:48:23,984 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1997976.0, ans=0.125 2023-06-28 06:48:35,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1997976.0, ans=0.035 2023-06-28 06:48:47,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1998036.0, ans=0.1 2023-06-28 06:49:50,473 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.773e+02 7.059e+02 1.070e+03 1.534e+03 3.837e+03, threshold=2.141e+03, percent-clipped=19.0 2023-06-28 06:50:05,474 INFO [train.py:996] (3/4) Epoch 11, batch 28100, loss[loss=0.2119, simple_loss=0.2726, pruned_loss=0.07559, over 19918.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2853, pruned_loss=0.06308, over 4285814.63 frames. ], batch size: 703, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:50:27,919 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.84 vs. limit=12.0 2023-06-28 06:50:56,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.55 vs. limit=22.5 2023-06-28 06:51:00,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1998396.0, ans=0.125 2023-06-28 06:51:02,265 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-28 06:51:09,099 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-28 06:51:29,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1998516.0, ans=0.1 2023-06-28 06:51:42,289 INFO [train.py:996] (3/4) Epoch 11, batch 28150, loss[loss=0.1931, simple_loss=0.2614, pruned_loss=0.0624, over 21864.00 frames. ], tot_loss[loss=0.2029, simple_loss=0.2793, pruned_loss=0.06322, over 4282836.27 frames. ], batch size: 373, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:52:07,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1998636.0, ans=0.0 2023-06-28 06:52:10,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1998636.0, ans=0.125 2023-06-28 06:52:14,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1998696.0, ans=0.125 2023-06-28 06:52:31,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1998696.0, ans=0.125 2023-06-28 06:53:04,712 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.152e+02 7.382e+02 1.011e+03 1.548e+03 3.347e+03, threshold=2.022e+03, percent-clipped=11.0 2023-06-28 06:53:19,834 INFO [train.py:996] (3/4) Epoch 11, batch 28200, loss[loss=0.244, simple_loss=0.3151, pruned_loss=0.08639, over 21566.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2779, pruned_loss=0.06457, over 4282826.11 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:53:25,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1998876.0, ans=0.0 2023-06-28 06:53:50,110 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-28 06:54:04,524 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 06:54:36,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1999056.0, ans=0.125 2023-06-28 06:54:58,082 INFO [train.py:996] (3/4) Epoch 11, batch 28250, loss[loss=0.218, simple_loss=0.281, pruned_loss=0.07743, over 21852.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2829, pruned_loss=0.06744, over 4279590.82 frames. ], batch size: 107, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:55:02,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-28 06:55:40,923 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-28 06:56:12,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-28 06:56:21,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 7.590e+02 1.013e+03 1.851e+03 3.926e+03, threshold=2.026e+03, percent-clipped=15.0 2023-06-28 06:56:30,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1999416.0, ans=0.0 2023-06-28 06:56:37,218 INFO [train.py:996] (3/4) Epoch 11, batch 28300, loss[loss=0.1532, simple_loss=0.2288, pruned_loss=0.03884, over 21247.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2805, pruned_loss=0.06529, over 4268658.60 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:56:37,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1999476.0, ans=0.1 2023-06-28 06:56:48,111 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1999476.0, ans=0.125 2023-06-28 06:57:04,082 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1999536.0, ans=0.125 2023-06-28 06:58:15,196 INFO [train.py:996] (3/4) Epoch 11, batch 28350, loss[loss=0.1624, simple_loss=0.256, pruned_loss=0.0344, over 21407.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2776, pruned_loss=0.0614, over 4264095.79 frames. ], batch size: 211, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 06:58:47,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1999836.0, ans=0.2 2023-06-28 06:59:26,766 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1999956.0, ans=0.1 2023-06-28 06:59:32,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1999956.0, ans=0.2 2023-06-28 06:59:37,854 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.816e+02 8.082e+02 1.144e+03 1.595e+03 4.896e+03, threshold=2.288e+03, percent-clipped=16.0 2023-06-28 06:59:43,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2000016.0, ans=0.0 2023-06-28 06:59:54,692 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2000016.0, ans=0.125 2023-06-28 06:59:57,601 INFO [train.py:996] (3/4) Epoch 11, batch 28400, loss[loss=0.2203, simple_loss=0.294, pruned_loss=0.07335, over 21683.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2742, pruned_loss=0.0616, over 4266491.93 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:01:15,103 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-28 07:01:19,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2000316.0, ans=0.025 2023-06-28 07:01:39,635 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2000376.0, ans=0.125 2023-06-28 07:01:40,571 INFO [train.py:996] (3/4) Epoch 11, batch 28450, loss[loss=0.2401, simple_loss=0.309, pruned_loss=0.08563, over 21687.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2794, pruned_loss=0.06465, over 4274378.84 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:03:03,711 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.045e+02 8.706e+02 1.350e+03 2.003e+03 3.584e+03, threshold=2.700e+03, percent-clipped=15.0 2023-06-28 07:03:12,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2000616.0, ans=0.05 2023-06-28 07:03:28,164 INFO [train.py:996] (3/4) Epoch 11, batch 28500, loss[loss=0.2169, simple_loss=0.2787, pruned_loss=0.07755, over 21577.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2821, pruned_loss=0.06657, over 4276177.88 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:04:04,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-28 07:04:25,999 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 07:05:11,628 INFO [train.py:996] (3/4) Epoch 11, batch 28550, loss[loss=0.2596, simple_loss=0.3552, pruned_loss=0.08197, over 21397.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2917, pruned_loss=0.06951, over 4281534.82 frames. ], batch size: 211, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:05:23,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2000976.0, ans=0.035 2023-06-28 07:05:46,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2001036.0, ans=0.0 2023-06-28 07:06:01,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-28 07:06:15,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2001156.0, ans=0.1 2023-06-28 07:06:41,149 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.280e+02 7.153e+02 1.164e+03 1.649e+03 3.101e+03, threshold=2.329e+03, percent-clipped=2.0 2023-06-28 07:06:59,350 INFO [train.py:996] (3/4) Epoch 11, batch 28600, loss[loss=0.2083, simple_loss=0.2835, pruned_loss=0.06654, over 21746.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.297, pruned_loss=0.07044, over 4277986.03 frames. ], batch size: 247, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:07:11,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2001276.0, ans=0.0 2023-06-28 07:07:25,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2001336.0, ans=0.2 2023-06-28 07:07:29,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2001336.0, ans=0.125 2023-06-28 07:07:33,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2001396.0, ans=0.125 2023-06-28 07:07:53,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2001456.0, ans=0.1 2023-06-28 07:08:30,245 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2001516.0, ans=0.125 2023-06-28 07:08:41,514 INFO [train.py:996] (3/4) Epoch 11, batch 28650, loss[loss=0.1766, simple_loss=0.2444, pruned_loss=0.05436, over 21751.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2916, pruned_loss=0.06942, over 4269496.39 frames. ], batch size: 124, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:09:36,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2001696.0, ans=0.0 2023-06-28 07:10:06,740 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.860e+02 6.883e+02 1.055e+03 1.706e+03 3.634e+03, threshold=2.110e+03, percent-clipped=9.0 2023-06-28 07:10:19,941 INFO [train.py:996] (3/4) Epoch 11, batch 28700, loss[loss=0.2391, simple_loss=0.302, pruned_loss=0.08805, over 21356.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2908, pruned_loss=0.07073, over 4266711.03 frames. ], batch size: 549, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:10:40,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2001936.0, ans=0.125 2023-06-28 07:11:42,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2002056.0, ans=0.125 2023-06-28 07:12:03,055 INFO [train.py:996] (3/4) Epoch 11, batch 28750, loss[loss=0.2445, simple_loss=0.3489, pruned_loss=0.07004, over 19845.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.293, pruned_loss=0.07186, over 4269892.12 frames. ], batch size: 703, lr: 2.60e-03, grad_scale: 8.0 2023-06-28 07:13:06,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2002296.0, ans=0.0 2023-06-28 07:13:14,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2002356.0, ans=0.0 2023-06-28 07:13:33,315 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.047e+02 7.903e+02 1.280e+03 1.957e+03 3.313e+03, threshold=2.559e+03, percent-clipped=20.0 2023-06-28 07:13:46,641 INFO [train.py:996] (3/4) Epoch 11, batch 28800, loss[loss=0.2262, simple_loss=0.3056, pruned_loss=0.07335, over 21767.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2959, pruned_loss=0.07228, over 4275300.55 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:13:47,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2002476.0, ans=0.125 2023-06-28 07:14:31,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2002536.0, ans=0.2 2023-06-28 07:14:59,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2002656.0, ans=0.2 2023-06-28 07:15:17,295 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2002716.0, ans=0.125 2023-06-28 07:15:28,355 INFO [train.py:996] (3/4) Epoch 11, batch 28850, loss[loss=0.2425, simple_loss=0.3113, pruned_loss=0.08688, over 21835.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2975, pruned_loss=0.07383, over 4280513.73 frames. ], batch size: 124, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:15:35,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2002776.0, ans=0.125 2023-06-28 07:15:58,072 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2002836.0, ans=0.1 2023-06-28 07:16:11,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2002836.0, ans=0.125 2023-06-28 07:16:23,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2002896.0, ans=0.125 2023-06-28 07:16:38,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2002956.0, ans=0.1 2023-06-28 07:16:43,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=15.0 2023-06-28 07:16:58,890 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.940e+02 7.188e+02 1.079e+03 1.563e+03 3.306e+03, threshold=2.159e+03, percent-clipped=4.0 2023-06-28 07:17:08,214 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2003016.0, ans=0.125 2023-06-28 07:17:12,896 INFO [train.py:996] (3/4) Epoch 11, batch 28900, loss[loss=0.2081, simple_loss=0.2774, pruned_loss=0.06942, over 21331.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2994, pruned_loss=0.07529, over 4283943.80 frames. ], batch size: 194, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:18:25,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2003256.0, ans=0.1 2023-06-28 07:18:30,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=2003256.0, ans=0.02 2023-06-28 07:19:06,027 INFO [train.py:996] (3/4) Epoch 11, batch 28950, loss[loss=0.223, simple_loss=0.3224, pruned_loss=0.06182, over 21866.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3005, pruned_loss=0.07436, over 4274580.03 frames. ], batch size: 371, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:19:06,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2003376.0, ans=0.0 2023-06-28 07:19:30,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2003436.0, ans=0.125 2023-06-28 07:19:46,156 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0 2023-06-28 07:20:06,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2003556.0, ans=0.125 2023-06-28 07:20:36,312 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.876e+02 7.501e+02 1.038e+03 1.526e+03 3.753e+03, threshold=2.076e+03, percent-clipped=10.0 2023-06-28 07:20:54,723 INFO [train.py:996] (3/4) Epoch 11, batch 29000, loss[loss=0.2116, simple_loss=0.2893, pruned_loss=0.06692, over 21499.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3032, pruned_loss=0.07325, over 4273088.03 frames. ], batch size: 194, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:21:00,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2003676.0, ans=0.125 2023-06-28 07:22:35,930 INFO [train.py:996] (3/4) Epoch 11, batch 29050, loss[loss=0.2158, simple_loss=0.2915, pruned_loss=0.07001, over 21882.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3014, pruned_loss=0.07386, over 4281400.27 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:22:41,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2003976.0, ans=0.1 2023-06-28 07:23:03,062 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.28 vs. limit=10.0 2023-06-28 07:23:26,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2004096.0, ans=0.1 2023-06-28 07:23:58,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2004156.0, ans=0.125 2023-06-28 07:24:04,795 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.241e+02 7.722e+02 1.077e+03 1.560e+03 2.970e+03, threshold=2.155e+03, percent-clipped=7.0 2023-06-28 07:24:18,317 INFO [train.py:996] (3/4) Epoch 11, batch 29100, loss[loss=0.1877, simple_loss=0.2578, pruned_loss=0.05884, over 21569.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2931, pruned_loss=0.07115, over 4276583.95 frames. ], batch size: 391, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:24:23,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2004276.0, ans=0.2 2023-06-28 07:24:37,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2004276.0, ans=0.125 2023-06-28 07:25:06,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2004396.0, ans=0.125 2023-06-28 07:25:38,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2004456.0, ans=0.035 2023-06-28 07:25:59,494 INFO [train.py:996] (3/4) Epoch 11, batch 29150, loss[loss=0.2097, simple_loss=0.296, pruned_loss=0.06168, over 20818.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2918, pruned_loss=0.06936, over 4276393.29 frames. ], batch size: 607, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:26:10,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2004576.0, ans=0.125 2023-06-28 07:27:20,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2004756.0, ans=0.125 2023-06-28 07:27:26,706 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.585e+02 7.263e+02 1.061e+03 1.748e+03 3.304e+03, threshold=2.122e+03, percent-clipped=12.0 2023-06-28 07:27:39,639 INFO [train.py:996] (3/4) Epoch 11, batch 29200, loss[loss=0.1772, simple_loss=0.2423, pruned_loss=0.05605, over 20731.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2869, pruned_loss=0.06826, over 4275265.59 frames. ], batch size: 608, lr: 2.60e-03, grad_scale: 32.0 2023-06-28 07:28:24,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2004996.0, ans=0.125 2023-06-28 07:28:33,382 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-28 07:28:50,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2005056.0, ans=0.1 2023-06-28 07:29:17,755 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-28 07:29:26,297 INFO [train.py:996] (3/4) Epoch 11, batch 29250, loss[loss=0.1773, simple_loss=0.2541, pruned_loss=0.05027, over 15853.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.285, pruned_loss=0.0661, over 4267278.48 frames. ], batch size: 60, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:29:48,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2005236.0, ans=0.1 2023-06-28 07:30:49,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2005416.0, ans=0.0 2023-06-28 07:30:52,288 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.090e+02 7.920e+02 1.206e+03 1.772e+03 3.423e+03, threshold=2.413e+03, percent-clipped=14.0 2023-06-28 07:30:54,427 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2005416.0, ans=0.0 2023-06-28 07:31:08,130 INFO [train.py:996] (3/4) Epoch 11, batch 29300, loss[loss=0.1899, simple_loss=0.2606, pruned_loss=0.05955, over 21681.00 frames. ], tot_loss[loss=0.2101, simple_loss=0.2879, pruned_loss=0.06611, over 4269175.47 frames. ], batch size: 112, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:31:36,783 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2005536.0, ans=0.125 2023-06-28 07:31:58,182 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2005596.0, ans=0.125 2023-06-28 07:32:13,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2005656.0, ans=0.1 2023-06-28 07:32:15,181 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2005656.0, ans=0.125 2023-06-28 07:32:46,274 INFO [train.py:996] (3/4) Epoch 11, batch 29350, loss[loss=0.1922, simple_loss=0.2777, pruned_loss=0.05331, over 21631.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2837, pruned_loss=0.06545, over 4276202.55 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:33:32,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2005896.0, ans=0.0 2023-06-28 07:33:38,770 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2005896.0, ans=0.0 2023-06-28 07:34:18,449 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.524e+02 6.284e+02 9.416e+02 1.465e+03 2.688e+03, threshold=1.883e+03, percent-clipped=1.0 2023-06-28 07:34:19,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2006016.0, ans=0.2 2023-06-28 07:34:30,000 INFO [train.py:996] (3/4) Epoch 11, batch 29400, loss[loss=0.1768, simple_loss=0.2605, pruned_loss=0.04653, over 21707.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2838, pruned_loss=0.06323, over 4278384.81 frames. ], batch size: 298, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:34:48,355 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=8.0 2023-06-28 07:35:01,580 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2006136.0, ans=0.125 2023-06-28 07:35:33,879 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=22.5 2023-06-28 07:35:41,745 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2006256.0, ans=0.2 2023-06-28 07:35:46,970 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2006256.0, ans=0.125 2023-06-28 07:36:13,427 INFO [train.py:996] (3/4) Epoch 11, batch 29450, loss[loss=0.2491, simple_loss=0.3252, pruned_loss=0.0865, over 21586.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2803, pruned_loss=0.0625, over 4273707.40 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:36:17,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2006376.0, ans=0.2 2023-06-28 07:36:27,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2006376.0, ans=0.2 2023-06-28 07:36:29,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2006436.0, ans=0.0 2023-06-28 07:37:17,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2006496.0, ans=0.125 2023-06-28 07:37:42,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2006616.0, ans=0.125 2023-06-28 07:37:43,662 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.711e+02 7.454e+02 1.204e+03 1.829e+03 3.653e+03, threshold=2.407e+03, percent-clipped=22.0 2023-06-28 07:37:54,984 INFO [train.py:996] (3/4) Epoch 11, batch 29500, loss[loss=0.2005, simple_loss=0.2755, pruned_loss=0.06276, over 21562.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2861, pruned_loss=0.06566, over 4276621.57 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 16.0 2023-06-28 07:38:30,170 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-28 07:38:58,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2006796.0, ans=0.0 2023-06-28 07:39:00,213 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2006796.0, ans=0.2 2023-06-28 07:39:05,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2006856.0, ans=0.0 2023-06-28 07:39:10,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2006856.0, ans=0.125 2023-06-28 07:39:36,803 INFO [train.py:996] (3/4) Epoch 11, batch 29550, loss[loss=0.2346, simple_loss=0.2967, pruned_loss=0.0862, over 21729.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2877, pruned_loss=0.06743, over 4280931.06 frames. ], batch size: 473, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:41:08,575 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.953e+02 8.227e+02 1.182e+03 1.842e+03 6.634e+03, threshold=2.364e+03, percent-clipped=14.0 2023-06-28 07:41:19,892 INFO [train.py:996] (3/4) Epoch 11, batch 29600, loss[loss=0.2101, simple_loss=0.298, pruned_loss=0.0611, over 21351.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2944, pruned_loss=0.07004, over 4282928.80 frames. ], batch size: 131, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 07:41:25,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2007276.0, ans=0.1 2023-06-28 07:41:35,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2007276.0, ans=0.1 2023-06-28 07:42:43,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2007516.0, ans=0.2 2023-06-28 07:42:48,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2007516.0, ans=0.1 2023-06-28 07:42:57,471 INFO [train.py:996] (3/4) Epoch 11, batch 29650, loss[loss=0.1788, simple_loss=0.2493, pruned_loss=0.05412, over 21302.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2937, pruned_loss=0.06757, over 4279380.50 frames. ], batch size: 159, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:43:17,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2007636.0, ans=0.95 2023-06-28 07:43:27,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2007636.0, ans=10.0 2023-06-28 07:43:47,463 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2007696.0, ans=0.04949747468305833 2023-06-28 07:44:26,232 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.283e+02 7.090e+02 1.074e+03 1.668e+03 4.986e+03, threshold=2.147e+03, percent-clipped=16.0 2023-06-28 07:44:40,973 INFO [train.py:996] (3/4) Epoch 11, batch 29700, loss[loss=0.2992, simple_loss=0.393, pruned_loss=0.1027, over 21550.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2949, pruned_loss=0.06757, over 4282574.07 frames. ], batch size: 471, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:44:55,233 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.04 vs. limit=10.0 2023-06-28 07:45:30,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2007996.0, ans=0.0 2023-06-28 07:46:08,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=2008116.0, ans=10.0 2023-06-28 07:46:22,813 INFO [train.py:996] (3/4) Epoch 11, batch 29750, loss[loss=0.1807, simple_loss=0.2818, pruned_loss=0.0398, over 19853.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2994, pruned_loss=0.06762, over 4285175.53 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:46:23,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2008176.0, ans=0.125 2023-06-28 07:47:13,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2008296.0, ans=0.0 2023-06-28 07:47:13,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2008296.0, ans=0.125 2023-06-28 07:47:14,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2008296.0, ans=0.0 2023-06-28 07:47:46,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2008416.0, ans=0.125 2023-06-28 07:47:49,323 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.946e+02 7.116e+02 1.082e+03 1.518e+03 2.580e+03, threshold=2.164e+03, percent-clipped=5.0 2023-06-28 07:48:07,924 INFO [train.py:996] (3/4) Epoch 11, batch 29800, loss[loss=0.2125, simple_loss=0.3093, pruned_loss=0.05789, over 17342.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2996, pruned_loss=0.06813, over 4284342.98 frames. ], batch size: 60, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:48:12,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2008476.0, ans=0.125 2023-06-28 07:48:21,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2008476.0, ans=0.125 2023-06-28 07:49:43,440 INFO [train.py:996] (3/4) Epoch 11, batch 29850, loss[loss=0.1926, simple_loss=0.2746, pruned_loss=0.05531, over 21868.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2951, pruned_loss=0.06648, over 4284192.03 frames. ], batch size: 333, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:50:16,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2008836.0, ans=0.0 2023-06-28 07:50:42,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2008896.0, ans=0.125 2023-06-28 07:50:42,785 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-28 07:51:09,753 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.656e+02 6.362e+02 8.665e+02 1.424e+03 2.891e+03, threshold=1.733e+03, percent-clipped=5.0 2023-06-28 07:51:12,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2009016.0, ans=0.125 2023-06-28 07:51:28,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=2009076.0, ans=0.2 2023-06-28 07:51:29,359 INFO [train.py:996] (3/4) Epoch 11, batch 29900, loss[loss=0.2243, simple_loss=0.2991, pruned_loss=0.07478, over 21651.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2935, pruned_loss=0.06738, over 4292385.26 frames. ], batch size: 263, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:51:59,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2009136.0, ans=10.0 2023-06-28 07:52:22,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2009196.0, ans=0.0 2023-06-28 07:52:32,920 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2009256.0, ans=0.125 2023-06-28 07:52:36,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2009256.0, ans=0.125 2023-06-28 07:52:48,391 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-28 07:52:57,131 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2009316.0, ans=0.0 2023-06-28 07:53:11,883 INFO [train.py:996] (3/4) Epoch 11, batch 29950, loss[loss=0.2535, simple_loss=0.3254, pruned_loss=0.09075, over 21552.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2974, pruned_loss=0.07051, over 4292003.24 frames. ], batch size: 415, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:54:50,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.989e+02 7.621e+02 1.240e+03 1.706e+03 3.587e+03, threshold=2.479e+03, percent-clipped=22.0 2023-06-28 07:55:04,724 INFO [train.py:996] (3/4) Epoch 11, batch 30000, loss[loss=0.1887, simple_loss=0.2782, pruned_loss=0.04954, over 21320.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2988, pruned_loss=0.07047, over 4294185.46 frames. ], batch size: 176, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 07:55:04,724 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 07:55:19,992 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.2.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7078, 3.2587, 3.1479, 3.2918], device='cuda:3') 2023-06-28 07:55:21,706 INFO [train.py:1028] (3/4) Epoch 11, validation: loss=0.2519, simple_loss=0.3444, pruned_loss=0.07975, over 1796401.00 frames. 2023-06-28 07:55:21,707 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 07:55:34,087 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-28 07:56:17,769 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=15.0 2023-06-28 07:56:29,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2009856.0, ans=0.125 2023-06-28 07:56:41,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2009856.0, ans=0.0 2023-06-28 07:57:10,910 INFO [train.py:996] (3/4) Epoch 11, batch 30050, loss[loss=0.2428, simple_loss=0.3736, pruned_loss=0.05601, over 20752.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.302, pruned_loss=0.06811, over 4288562.67 frames. ], batch size: 607, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 07:57:13,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2009976.0, ans=0.125 2023-06-28 07:58:37,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=2010216.0, ans=0.5 2023-06-28 07:58:44,578 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.592e+02 6.904e+02 1.436e+03 2.249e+03 4.425e+03, threshold=2.873e+03, percent-clipped=20.0 2023-06-28 07:58:53,200 INFO [train.py:996] (3/4) Epoch 11, batch 30100, loss[loss=0.1891, simple_loss=0.2582, pruned_loss=0.06, over 21603.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.3007, pruned_loss=0.06778, over 4285035.87 frames. ], batch size: 332, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 07:58:56,928 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2010276.0, ans=0.2 2023-06-28 07:59:26,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-28 07:59:54,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2010396.0, ans=0.125 2023-06-28 08:00:06,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2010456.0, ans=0.2 2023-06-28 08:00:21,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2010516.0, ans=0.125 2023-06-28 08:00:21,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2010516.0, ans=0.0 2023-06-28 08:00:36,680 INFO [train.py:996] (3/4) Epoch 11, batch 30150, loss[loss=0.2064, simple_loss=0.2631, pruned_loss=0.07487, over 20224.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2967, pruned_loss=0.06953, over 4284027.68 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:00:59,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2010576.0, ans=0.1 2023-06-28 08:01:25,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2010636.0, ans=0.1 2023-06-28 08:01:34,712 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2010696.0, ans=0.125 2023-06-28 08:01:39,680 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:02:07,485 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2010816.0, ans=0.125 2023-06-28 08:02:15,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2010816.0, ans=0.1 2023-06-28 08:02:18,805 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.690e+02 6.702e+02 9.063e+02 1.523e+03 3.175e+03, threshold=1.813e+03, percent-clipped=2.0 2023-06-28 08:02:29,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2010816.0, ans=0.1 2023-06-28 08:02:36,688 INFO [train.py:996] (3/4) Epoch 11, batch 30200, loss[loss=0.212, simple_loss=0.3287, pruned_loss=0.04765, over 21226.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2985, pruned_loss=0.06765, over 4272792.72 frames. ], batch size: 549, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:02:53,253 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.91 vs. limit=8.0 2023-06-28 08:03:16,184 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-28 08:03:17,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2010996.0, ans=0.025 2023-06-28 08:03:26,088 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2010996.0, ans=0.0 2023-06-28 08:04:21,931 INFO [train.py:996] (3/4) Epoch 11, batch 30250, loss[loss=0.2271, simple_loss=0.3329, pruned_loss=0.06067, over 21645.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3046, pruned_loss=0.06908, over 4269058.29 frames. ], batch size: 230, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:04:55,519 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.54 vs. limit=15.0 2023-06-28 08:05:09,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2011296.0, ans=0.2 2023-06-28 08:05:10,195 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-28 08:05:55,955 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.393e+02 7.352e+02 1.154e+03 1.714e+03 3.720e+03, threshold=2.308e+03, percent-clipped=21.0 2023-06-28 08:06:00,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2011416.0, ans=0.0 2023-06-28 08:06:04,342 INFO [train.py:996] (3/4) Epoch 11, batch 30300, loss[loss=0.1963, simple_loss=0.2608, pruned_loss=0.06591, over 21795.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.3008, pruned_loss=0.06905, over 4269765.72 frames. ], batch size: 352, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:06:10,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.79 vs. limit=5.0 2023-06-28 08:06:22,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2011536.0, ans=0.04949747468305833 2023-06-28 08:07:50,276 INFO [train.py:996] (3/4) Epoch 11, batch 30350, loss[loss=0.3069, simple_loss=0.3941, pruned_loss=0.1099, over 21443.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3013, pruned_loss=0.07051, over 4267683.38 frames. ], batch size: 471, lr: 2.59e-03, grad_scale: 16.0 2023-06-28 08:08:02,911 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.94 vs. limit=22.5 2023-06-28 08:08:05,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2011836.0, ans=0.125 2023-06-28 08:08:51,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2012016.0, ans=0.1 2023-06-28 08:08:57,087 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.778e+02 8.957e+02 1.374e+03 2.295e+03 4.777e+03, threshold=2.749e+03, percent-clipped=24.0 2023-06-28 08:09:10,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2012076.0, ans=0.2 2023-06-28 08:09:11,762 INFO [train.py:996] (3/4) Epoch 11, batch 30400, loss[loss=0.1989, simple_loss=0.2508, pruned_loss=0.07349, over 20391.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2969, pruned_loss=0.06959, over 4259870.72 frames. ], batch size: 703, lr: 2.59e-03, grad_scale: 32.0 2023-06-28 08:09:34,428 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2012136.0, ans=0.0 2023-06-28 08:10:08,849 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-28 08:10:18,679 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.93 vs. limit=6.0 2023-06-28 08:10:30,028 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2012316.0, ans=0.1 2023-06-28 08:10:34,679 INFO [train.py:996] (3/4) Epoch 11, batch 30450, loss[loss=0.2616, simple_loss=0.3828, pruned_loss=0.07015, over 19880.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2978, pruned_loss=0.0691, over 4200714.24 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-28 08:10:51,623 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:10:52,899 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2012436.0, ans=0.125 2023-06-28 08:13:53,271 INFO [train.py:996] (3/4) Epoch 12, batch 0, loss[loss=0.2077, simple_loss=0.2836, pruned_loss=0.06587, over 21807.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2836, pruned_loss=0.06587, over 21807.00 frames. ], batch size: 102, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:13:53,272 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 08:14:03,523 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.4428, 2.2785, 4.3565, 4.0517], device='cuda:3') 2023-06-28 08:14:09,656 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2477, simple_loss=0.3485, pruned_loss=0.0734, over 1796401.00 frames. 2023-06-28 08:14:09,657 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 08:14:12,903 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.161e+02 1.803e+03 3.374e+03 5.381e+03 1.358e+04, threshold=6.748e+03, percent-clipped=56.0 2023-06-28 08:14:52,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2012766.0, ans=0.125 2023-06-28 08:15:36,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.27 vs. limit=15.0 2023-06-28 08:15:54,160 INFO [train.py:996] (3/4) Epoch 12, batch 50, loss[loss=0.2544, simple_loss=0.3614, pruned_loss=0.07373, over 21641.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3037, pruned_loss=0.06913, over 956001.58 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:15:57,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=2012946.0, ans=15.0 2023-06-28 08:17:18,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2013126.0, ans=0.125 2023-06-28 08:17:23,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2013186.0, ans=0.1 2023-06-28 08:17:37,276 INFO [train.py:996] (3/4) Epoch 12, batch 100, loss[loss=0.2288, simple_loss=0.3207, pruned_loss=0.06849, over 21867.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3167, pruned_loss=0.06914, over 1687533.77 frames. ], batch size: 316, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:17:40,463 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.118e+02 6.672e+02 9.899e+02 1.706e+03 3.699e+03, threshold=1.980e+03, percent-clipped=0.0 2023-06-28 08:18:02,629 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2013306.0, ans=0.0 2023-06-28 08:18:21,047 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-28 08:18:48,396 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:18:56,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2013426.0, ans=0.0 2023-06-28 08:19:04,495 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2013486.0, ans=0.125 2023-06-28 08:19:18,617 INFO [train.py:996] (3/4) Epoch 12, batch 150, loss[loss=0.2112, simple_loss=0.3047, pruned_loss=0.05881, over 21551.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3177, pruned_loss=0.06946, over 2263330.87 frames. ], batch size: 230, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:19:36,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2013606.0, ans=0.2 2023-06-28 08:20:16,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2013666.0, ans=0.2 2023-06-28 08:20:45,207 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2013786.0, ans=0.0 2023-06-28 08:20:57,641 INFO [train.py:996] (3/4) Epoch 12, batch 200, loss[loss=0.1982, simple_loss=0.2715, pruned_loss=0.06249, over 21118.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3152, pruned_loss=0.07013, over 2696540.41 frames. ], batch size: 143, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:21:00,957 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.147e+02 7.904e+02 1.199e+03 1.656e+03 3.803e+03, threshold=2.398e+03, percent-clipped=21.0 2023-06-28 08:21:31,359 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-28 08:22:19,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2014026.0, ans=0.95 2023-06-28 08:22:23,314 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-28 08:22:39,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2014086.0, ans=0.125 2023-06-28 08:22:42,025 INFO [train.py:996] (3/4) Epoch 12, batch 250, loss[loss=0.183, simple_loss=0.2572, pruned_loss=0.0544, over 21796.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3098, pruned_loss=0.06972, over 3052070.46 frames. ], batch size: 112, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:23:23,419 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2014206.0, ans=0.2 2023-06-28 08:24:12,241 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.84 vs. limit=15.0 2023-06-28 08:24:32,073 INFO [train.py:996] (3/4) Epoch 12, batch 300, loss[loss=0.1956, simple_loss=0.2672, pruned_loss=0.06197, over 21888.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3029, pruned_loss=0.06913, over 3335046.09 frames. ], batch size: 316, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:24:35,392 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.745e+02 6.973e+02 9.077e+02 1.413e+03 3.093e+03, threshold=1.815e+03, percent-clipped=6.0 2023-06-28 08:25:09,571 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-28 08:25:11,285 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:26:06,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2014686.0, ans=0.125 2023-06-28 08:26:20,892 INFO [train.py:996] (3/4) Epoch 12, batch 350, loss[loss=0.2393, simple_loss=0.3085, pruned_loss=0.08512, over 21372.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2963, pruned_loss=0.06869, over 3547117.62 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:27:30,525 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2014866.0, ans=0.0 2023-06-28 08:28:07,201 INFO [train.py:996] (3/4) Epoch 12, batch 400, loss[loss=0.1788, simple_loss=0.2384, pruned_loss=0.05964, over 21277.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.29, pruned_loss=0.06727, over 3714863.58 frames. ], batch size: 551, lr: 2.47e-03, grad_scale: 32.0 2023-06-28 08:28:10,666 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 7.672e+02 1.106e+03 1.472e+03 3.614e+03, threshold=2.212e+03, percent-clipped=11.0 2023-06-28 08:28:13,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2015046.0, ans=0.1 2023-06-28 08:28:13,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2015046.0, ans=0.125 2023-06-28 08:28:14,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2015046.0, ans=0.0 2023-06-28 08:28:23,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=15.0 2023-06-28 08:28:52,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2015166.0, ans=0.125 2023-06-28 08:29:30,752 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.89 vs. limit=15.0 2023-06-28 08:29:53,471 INFO [train.py:996] (3/4) Epoch 12, batch 450, loss[loss=0.1772, simple_loss=0.2759, pruned_loss=0.0393, over 21676.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2861, pruned_loss=0.06551, over 3841197.10 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:29:57,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2015346.0, ans=0.5 2023-06-28 08:30:22,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-28 08:30:37,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2015466.0, ans=0.0 2023-06-28 08:31:01,873 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.35 vs. limit=5.0 2023-06-28 08:31:37,497 INFO [train.py:996] (3/4) Epoch 12, batch 500, loss[loss=0.1835, simple_loss=0.2537, pruned_loss=0.05665, over 21766.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2865, pruned_loss=0.06392, over 3934146.00 frames. ], batch size: 317, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:31:42,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.576e+02 9.650e+02 1.378e+03 2.425e+03 6.087e+03, threshold=2.755e+03, percent-clipped=29.0 2023-06-28 08:33:00,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2015826.0, ans=0.1 2023-06-28 08:33:02,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2015826.0, ans=0.125 2023-06-28 08:33:05,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2015886.0, ans=0.125 2023-06-28 08:33:22,126 INFO [train.py:996] (3/4) Epoch 12, batch 550, loss[loss=0.2173, simple_loss=0.2947, pruned_loss=0.06996, over 21887.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2919, pruned_loss=0.06434, over 4006296.02 frames. ], batch size: 118, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:34:00,958 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:34:36,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-28 08:34:47,461 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-28 08:35:00,968 INFO [train.py:996] (3/4) Epoch 12, batch 600, loss[loss=0.2925, simple_loss=0.3839, pruned_loss=0.1006, over 21535.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2968, pruned_loss=0.06526, over 4072776.16 frames. ], batch size: 508, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:35:05,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.203e+02 8.466e+02 1.434e+03 2.194e+03 5.258e+03, threshold=2.867e+03, percent-clipped=12.0 2023-06-28 08:35:27,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2016306.0, ans=0.125 2023-06-28 08:36:44,607 INFO [train.py:996] (3/4) Epoch 12, batch 650, loss[loss=0.2733, simple_loss=0.3242, pruned_loss=0.1112, over 21770.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2981, pruned_loss=0.06592, over 4109781.52 frames. ], batch size: 508, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:37:16,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.19 vs. limit=15.0 2023-06-28 08:37:50,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2016726.0, ans=0.125 2023-06-28 08:38:11,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2016786.0, ans=0.0 2023-06-28 08:38:11,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2016786.0, ans=0.0 2023-06-28 08:38:19,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-28 08:38:23,207 INFO [train.py:996] (3/4) Epoch 12, batch 700, loss[loss=0.3572, simple_loss=0.4261, pruned_loss=0.1442, over 21569.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2973, pruned_loss=0.06703, over 4153070.57 frames. ], batch size: 508, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 08:38:26,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2016846.0, ans=0.0 2023-06-28 08:38:34,693 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.083e+02 8.629e+02 1.370e+03 1.985e+03 4.368e+03, threshold=2.739e+03, percent-clipped=8.0 2023-06-28 08:38:38,719 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:40:06,458 INFO [train.py:996] (3/4) Epoch 12, batch 750, loss[loss=0.2076, simple_loss=0.337, pruned_loss=0.03915, over 19801.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2955, pruned_loss=0.06741, over 4180279.90 frames. ], batch size: 703, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 08:40:40,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2017206.0, ans=0.125 2023-06-28 08:41:09,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2017266.0, ans=0.1 2023-06-28 08:41:15,324 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:41:22,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2017326.0, ans=0.125 2023-06-28 08:41:49,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2017446.0, ans=0.125 2023-06-28 08:41:50,315 INFO [train.py:996] (3/4) Epoch 12, batch 800, loss[loss=0.2038, simple_loss=0.298, pruned_loss=0.05477, over 21706.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2936, pruned_loss=0.06765, over 4194111.98 frames. ], batch size: 298, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:42:01,927 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 9.081e+02 1.260e+03 2.091e+03 4.459e+03, threshold=2.521e+03, percent-clipped=14.0 2023-06-28 08:43:00,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2017626.0, ans=0.125 2023-06-28 08:43:06,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2017626.0, ans=0.1 2023-06-28 08:43:13,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2017626.0, ans=0.0 2023-06-28 08:43:15,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2017686.0, ans=0.125 2023-06-28 08:43:33,338 INFO [train.py:996] (3/4) Epoch 12, batch 850, loss[loss=0.1891, simple_loss=0.2804, pruned_loss=0.04894, over 21657.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2906, pruned_loss=0.06708, over 4214042.97 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:44:07,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2017806.0, ans=0.2 2023-06-28 08:44:59,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2017926.0, ans=0.125 2023-06-28 08:45:15,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=2017986.0, ans=6.0 2023-06-28 08:45:24,344 INFO [train.py:996] (3/4) Epoch 12, batch 900, loss[loss=0.1645, simple_loss=0.2418, pruned_loss=0.0436, over 21848.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2873, pruned_loss=0.06607, over 4236895.65 frames. ], batch size: 118, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:45:35,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.139e+02 7.978e+02 1.292e+03 1.942e+03 4.093e+03, threshold=2.584e+03, percent-clipped=13.0 2023-06-28 08:45:44,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2018106.0, ans=0.125 2023-06-28 08:45:49,829 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=2018106.0, ans=0.05 2023-06-28 08:46:01,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2018106.0, ans=0.035 2023-06-28 08:46:27,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2018166.0, ans=0.125 2023-06-28 08:47:11,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2018286.0, ans=0.1 2023-06-28 08:47:14,379 INFO [train.py:996] (3/4) Epoch 12, batch 950, loss[loss=0.1883, simple_loss=0.2501, pruned_loss=0.06327, over 21316.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.286, pruned_loss=0.06564, over 4253220.12 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:47:33,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2018346.0, ans=0.125 2023-06-28 08:47:57,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2018466.0, ans=0.125 2023-06-28 08:48:01,319 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2018466.0, ans=0.125 2023-06-28 08:48:03,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2018466.0, ans=0.0 2023-06-28 08:48:10,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-28 08:48:42,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2018586.0, ans=0.0 2023-06-28 08:48:56,773 INFO [train.py:996] (3/4) Epoch 12, batch 1000, loss[loss=0.1731, simple_loss=0.2471, pruned_loss=0.04953, over 21169.00 frames. ], tot_loss[loss=0.2076, simple_loss=0.2858, pruned_loss=0.06474, over 4262728.07 frames. ], batch size: 143, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:49:00,898 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:49:03,706 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.940e+02 7.062e+02 8.970e+02 1.402e+03 3.868e+03, threshold=1.794e+03, percent-clipped=7.0 2023-06-28 08:49:26,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-28 08:50:08,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2018826.0, ans=0.2 2023-06-28 08:50:21,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2018886.0, ans=0.125 2023-06-28 08:50:31,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2018886.0, ans=0.0 2023-06-28 08:50:42,111 INFO [train.py:996] (3/4) Epoch 12, batch 1050, loss[loss=0.2189, simple_loss=0.2947, pruned_loss=0.07158, over 21773.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2857, pruned_loss=0.06526, over 4266498.19 frames. ], batch size: 389, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:50:55,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2018946.0, ans=0.1 2023-06-28 08:51:03,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2019006.0, ans=0.125 2023-06-28 08:51:23,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2019006.0, ans=0.0 2023-06-28 08:51:34,946 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2019066.0, ans=0.0 2023-06-28 08:52:31,733 INFO [train.py:996] (3/4) Epoch 12, batch 1100, loss[loss=0.2396, simple_loss=0.3177, pruned_loss=0.08079, over 21297.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2874, pruned_loss=0.0652, over 4270706.30 frames. ], batch size: 159, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:52:39,016 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.826e+02 7.483e+02 1.102e+03 1.696e+03 3.574e+03, threshold=2.203e+03, percent-clipped=22.0 2023-06-28 08:53:11,594 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2019366.0, ans=0.5 2023-06-28 08:53:14,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2019366.0, ans=0.125 2023-06-28 08:53:14,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2019366.0, ans=0.2 2023-06-28 08:53:16,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2019366.0, ans=0.1 2023-06-28 08:53:59,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2019486.0, ans=0.0 2023-06-28 08:54:17,190 INFO [train.py:996] (3/4) Epoch 12, batch 1150, loss[loss=0.266, simple_loss=0.3351, pruned_loss=0.09842, over 21589.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2875, pruned_loss=0.06415, over 4268207.72 frames. ], batch size: 389, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:54:59,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-28 08:55:33,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2019726.0, ans=0.0 2023-06-28 08:55:40,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2019786.0, ans=0.125 2023-06-28 08:56:08,755 INFO [train.py:996] (3/4) Epoch 12, batch 1200, loss[loss=0.2158, simple_loss=0.303, pruned_loss=0.06436, over 21749.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2876, pruned_loss=0.06369, over 4268542.39 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 32.0 2023-06-28 08:56:15,504 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.003e+02 8.397e+02 1.494e+03 2.117e+03 4.524e+03, threshold=2.987e+03, percent-clipped=23.0 2023-06-28 08:56:29,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2019906.0, ans=0.1 2023-06-28 08:56:31,390 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-28 08:56:41,276 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 08:56:44,642 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2019966.0, ans=0.0 2023-06-28 08:56:48,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-28 08:56:49,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2019966.0, ans=0.1 2023-06-28 08:56:52,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2019966.0, ans=0.0 2023-06-28 08:56:52,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2019966.0, ans=0.0 2023-06-28 08:57:22,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2020026.0, ans=0.1 2023-06-28 08:57:43,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-28 08:57:49,438 INFO [train.py:996] (3/4) Epoch 12, batch 1250, loss[loss=0.2327, simple_loss=0.3069, pruned_loss=0.07921, over 21386.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2914, pruned_loss=0.0658, over 4275022.91 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:58:11,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2020206.0, ans=0.2 2023-06-28 08:58:57,379 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-28 08:59:22,063 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2020386.0, ans=0.125 2023-06-28 08:59:40,399 INFO [train.py:996] (3/4) Epoch 12, batch 1300, loss[loss=0.2286, simple_loss=0.3046, pruned_loss=0.07632, over 21846.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.2934, pruned_loss=0.06661, over 4285176.82 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 08:59:48,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.742e+02 7.744e+02 1.078e+03 1.630e+03 3.241e+03, threshold=2.156e+03, percent-clipped=1.0 2023-06-28 08:59:52,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2020446.0, ans=0.0 2023-06-28 08:59:57,159 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-06-28 09:00:14,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2020566.0, ans=0.125 2023-06-28 09:00:49,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.29 vs. limit=15.0 2023-06-28 09:01:24,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2020746.0, ans=0.125 2023-06-28 09:01:25,435 INFO [train.py:996] (3/4) Epoch 12, batch 1350, loss[loss=0.1987, simple_loss=0.2748, pruned_loss=0.06131, over 21206.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2943, pruned_loss=0.06724, over 4288716.41 frames. ], batch size: 159, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:01:45,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2020806.0, ans=0.125 2023-06-28 09:02:03,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2020866.0, ans=0.035 2023-06-28 09:03:04,977 INFO [train.py:996] (3/4) Epoch 12, batch 1400, loss[loss=0.2274, simple_loss=0.3037, pruned_loss=0.07554, over 21282.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2925, pruned_loss=0.06699, over 4292789.58 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:03:13,307 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.716e+02 8.874e+02 1.255e+03 1.971e+03 3.857e+03, threshold=2.510e+03, percent-clipped=18.0 2023-06-28 09:03:32,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2021106.0, ans=0.0 2023-06-28 09:03:34,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2021106.0, ans=0.09899494936611666 2023-06-28 09:03:39,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2021166.0, ans=0.0 2023-06-28 09:03:46,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-28 09:04:50,312 INFO [train.py:996] (3/4) Epoch 12, batch 1450, loss[loss=0.2102, simple_loss=0.3241, pruned_loss=0.04815, over 19815.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2926, pruned_loss=0.06658, over 4293209.50 frames. ], batch size: 703, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:05:08,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-28 09:05:28,404 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2021466.0, ans=0.0 2023-06-28 09:06:27,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2021586.0, ans=0.125 2023-06-28 09:06:32,973 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2021586.0, ans=0.0 2023-06-28 09:06:37,301 INFO [train.py:996] (3/4) Epoch 12, batch 1500, loss[loss=0.2209, simple_loss=0.3022, pruned_loss=0.06975, over 21738.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2935, pruned_loss=0.06802, over 4293848.91 frames. ], batch size: 389, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:06:47,785 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.815e+02 8.356e+02 1.274e+03 1.855e+03 4.343e+03, threshold=2.548e+03, percent-clipped=12.0 2023-06-28 09:07:01,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2021706.0, ans=0.1 2023-06-28 09:07:10,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2021766.0, ans=0.2 2023-06-28 09:07:32,539 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2021766.0, ans=0.125 2023-06-28 09:07:32,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2021766.0, ans=0.1 2023-06-28 09:08:24,458 INFO [train.py:996] (3/4) Epoch 12, batch 1550, loss[loss=0.2039, simple_loss=0.2835, pruned_loss=0.06214, over 21824.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2917, pruned_loss=0.067, over 4293268.77 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:09:11,800 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2022066.0, ans=0.2 2023-06-28 09:09:14,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2022066.0, ans=0.125 2023-06-28 09:09:32,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2022126.0, ans=0.0 2023-06-28 09:09:53,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2022186.0, ans=0.125 2023-06-28 09:10:03,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2022186.0, ans=0.0 2023-06-28 09:10:09,934 INFO [train.py:996] (3/4) Epoch 12, batch 1600, loss[loss=0.2001, simple_loss=0.2752, pruned_loss=0.06247, over 21669.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2883, pruned_loss=0.06515, over 4284911.63 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:10:20,072 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.882e+02 7.910e+02 1.210e+03 1.920e+03 3.790e+03, threshold=2.419e+03, percent-clipped=9.0 2023-06-28 09:10:25,754 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2022306.0, ans=0.0 2023-06-28 09:10:45,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2022306.0, ans=0.1 2023-06-28 09:11:14,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2022366.0, ans=0.1 2023-06-28 09:11:25,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2022426.0, ans=0.125 2023-06-28 09:11:29,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2022426.0, ans=0.125 2023-06-28 09:11:45,371 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=15.0 2023-06-28 09:11:58,022 INFO [train.py:996] (3/4) Epoch 12, batch 1650, loss[loss=0.2076, simple_loss=0.2963, pruned_loss=0.0594, over 21937.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2868, pruned_loss=0.06443, over 4289603.78 frames. ], batch size: 317, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:12:35,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2022606.0, ans=0.125 2023-06-28 09:13:45,605 INFO [train.py:996] (3/4) Epoch 12, batch 1700, loss[loss=0.2073, simple_loss=0.2881, pruned_loss=0.06323, over 21448.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2874, pruned_loss=0.06542, over 4283942.38 frames. ], batch size: 194, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:13:55,787 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.138e+02 6.859e+02 1.024e+03 1.407e+03 3.205e+03, threshold=2.048e+03, percent-clipped=5.0 2023-06-28 09:14:25,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2022906.0, ans=0.1 2023-06-28 09:14:52,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2023026.0, ans=0.0 2023-06-28 09:15:23,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=2023086.0, ans=15.0 2023-06-28 09:15:32,583 INFO [train.py:996] (3/4) Epoch 12, batch 1750, loss[loss=0.1562, simple_loss=0.2467, pruned_loss=0.03281, over 21764.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2889, pruned_loss=0.06455, over 4285591.44 frames. ], batch size: 282, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:16:14,996 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2023206.0, ans=0.1 2023-06-28 09:16:45,368 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.59 vs. limit=15.0 2023-06-28 09:16:55,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2023326.0, ans=0.1 2023-06-28 09:17:25,794 INFO [train.py:996] (3/4) Epoch 12, batch 1800, loss[loss=0.2055, simple_loss=0.303, pruned_loss=0.05403, over 21710.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2872, pruned_loss=0.0619, over 4274927.81 frames. ], batch size: 332, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:17:46,589 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.284e+02 7.829e+02 1.190e+03 1.910e+03 4.483e+03, threshold=2.381e+03, percent-clipped=19.0 2023-06-28 09:18:14,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2023566.0, ans=0.1 2023-06-28 09:18:28,807 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-28 09:19:11,645 INFO [train.py:996] (3/4) Epoch 12, batch 1850, loss[loss=0.253, simple_loss=0.3392, pruned_loss=0.08339, over 21522.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2898, pruned_loss=0.06071, over 4272953.05 frames. ], batch size: 507, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:19:37,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2023806.0, ans=0.0 2023-06-28 09:19:58,980 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-28 09:20:59,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2024046.0, ans=0.05 2023-06-28 09:20:59,929 INFO [train.py:996] (3/4) Epoch 12, batch 1900, loss[loss=0.2146, simple_loss=0.2866, pruned_loss=0.07126, over 21808.00 frames. ], tot_loss[loss=0.2074, simple_loss=0.2915, pruned_loss=0.06167, over 4276516.78 frames. ], batch size: 112, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:21:14,712 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.06 vs. limit=15.0 2023-06-28 09:21:22,274 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.886e+02 8.389e+02 1.357e+03 2.180e+03 3.591e+03, threshold=2.714e+03, percent-clipped=20.0 2023-06-28 09:21:22,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2024046.0, ans=0.125 2023-06-28 09:21:22,981 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2024046.0, ans=0.1 2023-06-28 09:21:27,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2024106.0, ans=0.1 2023-06-28 09:21:27,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2024106.0, ans=0.0 2023-06-28 09:22:13,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2024226.0, ans=0.125 2023-06-28 09:22:30,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.46 vs. limit=10.0 2023-06-28 09:22:33,955 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-28 09:22:40,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2024286.0, ans=0.125 2023-06-28 09:22:53,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-28 09:22:54,023 INFO [train.py:996] (3/4) Epoch 12, batch 1950, loss[loss=0.1832, simple_loss=0.2489, pruned_loss=0.05875, over 21582.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.288, pruned_loss=0.06144, over 4280650.35 frames. ], batch size: 282, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:23:04,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2024346.0, ans=0.125 2023-06-28 09:24:40,570 INFO [train.py:996] (3/4) Epoch 12, batch 2000, loss[loss=0.2332, simple_loss=0.3218, pruned_loss=0.0723, over 20018.00 frames. ], tot_loss[loss=0.2021, simple_loss=0.2835, pruned_loss=0.0603, over 4269761.04 frames. ], batch size: 702, lr: 2.47e-03, grad_scale: 16.0 2023-06-28 09:24:52,580 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.543e+02 8.090e+02 1.262e+03 2.210e+03 4.405e+03, threshold=2.524e+03, percent-clipped=15.0 2023-06-28 09:26:25,018 INFO [train.py:996] (3/4) Epoch 12, batch 2050, loss[loss=0.2059, simple_loss=0.3011, pruned_loss=0.05538, over 21571.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2859, pruned_loss=0.06055, over 4271323.38 frames. ], batch size: 230, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:26:45,376 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2025006.0, ans=0.1 2023-06-28 09:26:47,560 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-28 09:26:55,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2025006.0, ans=0.0 2023-06-28 09:26:58,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2025066.0, ans=0.04949747468305833 2023-06-28 09:28:07,566 INFO [train.py:996] (3/4) Epoch 12, batch 2100, loss[loss=0.2705, simple_loss=0.3474, pruned_loss=0.09686, over 21585.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2892, pruned_loss=0.0622, over 4266814.54 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:28:21,405 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.339e+02 9.933e+02 1.500e+03 2.145e+03 4.437e+03, threshold=3.000e+03, percent-clipped=17.0 2023-06-28 09:28:29,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2025306.0, ans=0.1 2023-06-28 09:28:40,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2025366.0, ans=0.1 2023-06-28 09:29:00,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2025366.0, ans=0.125 2023-06-28 09:29:52,488 INFO [train.py:996] (3/4) Epoch 12, batch 2150, loss[loss=0.1983, simple_loss=0.2729, pruned_loss=0.06181, over 21674.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2883, pruned_loss=0.06333, over 4269506.82 frames. ], batch size: 298, lr: 2.47e-03, grad_scale: 8.0 2023-06-28 09:31:37,764 INFO [train.py:996] (3/4) Epoch 12, batch 2200, loss[loss=0.1644, simple_loss=0.2357, pruned_loss=0.04657, over 15889.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2904, pruned_loss=0.06403, over 4264900.83 frames. ], batch size: 62, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:31:51,396 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.594e+02 7.144e+02 1.049e+03 1.524e+03 3.402e+03, threshold=2.098e+03, percent-clipped=4.0 2023-06-28 09:31:57,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2025906.0, ans=0.0 2023-06-28 09:31:58,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2025906.0, ans=0.0 2023-06-28 09:32:52,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2026026.0, ans=0.125 2023-06-28 09:33:21,818 INFO [train.py:996] (3/4) Epoch 12, batch 2250, loss[loss=0.1943, simple_loss=0.2662, pruned_loss=0.06123, over 21655.00 frames. ], tot_loss[loss=0.207, simple_loss=0.2877, pruned_loss=0.06312, over 4262135.38 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:34:00,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2026266.0, ans=0.125 2023-06-28 09:35:00,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2026386.0, ans=0.0 2023-06-28 09:35:01,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2026386.0, ans=0.0 2023-06-28 09:35:06,609 INFO [train.py:996] (3/4) Epoch 12, batch 2300, loss[loss=0.1701, simple_loss=0.2386, pruned_loss=0.05082, over 21410.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2828, pruned_loss=0.06269, over 4255837.74 frames. ], batch size: 194, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:35:07,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2026446.0, ans=0.125 2023-06-28 09:35:13,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2026446.0, ans=0.125 2023-06-28 09:35:20,307 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 7.194e+02 1.165e+03 1.936e+03 3.464e+03, threshold=2.331e+03, percent-clipped=21.0 2023-06-28 09:35:46,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2026566.0, ans=0.0 2023-06-28 09:35:48,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2026566.0, ans=0.125 2023-06-28 09:35:54,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2026566.0, ans=0.125 2023-06-28 09:36:39,945 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2026686.0, ans=0.0 2023-06-28 09:36:53,075 INFO [train.py:996] (3/4) Epoch 12, batch 2350, loss[loss=0.2064, simple_loss=0.2791, pruned_loss=0.06683, over 21737.00 frames. ], tot_loss[loss=0.2036, simple_loss=0.2802, pruned_loss=0.06353, over 4260058.01 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 09:36:57,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2026746.0, ans=0.125 2023-06-28 09:36:59,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2026746.0, ans=0.125 2023-06-28 09:37:27,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2026806.0, ans=0.125 2023-06-28 09:37:27,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2026806.0, ans=0.125 2023-06-28 09:38:18,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2026926.0, ans=0.125 2023-06-28 09:38:25,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2026986.0, ans=0.125 2023-06-28 09:38:38,340 INFO [train.py:996] (3/4) Epoch 12, batch 2400, loss[loss=0.22, simple_loss=0.2977, pruned_loss=0.0712, over 21298.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2843, pruned_loss=0.06571, over 4267406.49 frames. ], batch size: 159, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:38:57,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.615e+02 1.092e+03 1.757e+03 3.744e+03, threshold=2.185e+03, percent-clipped=12.0 2023-06-28 09:39:03,143 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2027106.0, ans=0.0 2023-06-28 09:39:42,514 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2027166.0, ans=0.0 2023-06-28 09:40:05,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2027286.0, ans=0.2 2023-06-28 09:40:24,040 INFO [train.py:996] (3/4) Epoch 12, batch 2450, loss[loss=0.2084, simple_loss=0.2911, pruned_loss=0.06279, over 21882.00 frames. ], tot_loss[loss=0.2121, simple_loss=0.2897, pruned_loss=0.06728, over 4269637.57 frames. ], batch size: 317, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:40:46,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2027406.0, ans=0.125 2023-06-28 09:40:51,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2027406.0, ans=0.1 2023-06-28 09:41:26,535 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=12.0 2023-06-28 09:41:39,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2027526.0, ans=10.0 2023-06-28 09:41:43,095 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.19 vs. limit=15.0 2023-06-28 09:41:45,722 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2027526.0, ans=0.0 2023-06-28 09:41:56,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2027586.0, ans=0.125 2023-06-28 09:42:08,965 INFO [train.py:996] (3/4) Epoch 12, batch 2500, loss[loss=0.1828, simple_loss=0.2523, pruned_loss=0.05664, over 21257.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2871, pruned_loss=0.06643, over 4276701.70 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:42:27,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.946e+02 7.873e+02 1.330e+03 1.943e+03 4.895e+03, threshold=2.659e+03, percent-clipped=18.0 2023-06-28 09:42:37,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2027706.0, ans=0.125 2023-06-28 09:42:41,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2027706.0, ans=0.125 2023-06-28 09:42:44,929 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-28 09:43:06,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2027766.0, ans=0.04949747468305833 2023-06-28 09:43:39,470 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.33 vs. limit=15.0 2023-06-28 09:43:52,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=2027946.0, ans=10.0 2023-06-28 09:43:53,476 INFO [train.py:996] (3/4) Epoch 12, batch 2550, loss[loss=0.2153, simple_loss=0.2855, pruned_loss=0.07261, over 21570.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2857, pruned_loss=0.06578, over 4275519.59 frames. ], batch size: 391, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:44:29,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2028006.0, ans=0.1 2023-06-28 09:45:37,007 INFO [train.py:996] (3/4) Epoch 12, batch 2600, loss[loss=0.2225, simple_loss=0.3426, pruned_loss=0.05116, over 19780.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2878, pruned_loss=0.06699, over 4265444.79 frames. ], batch size: 703, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:45:37,452 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 09:45:55,190 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=12.0 2023-06-28 09:45:55,675 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.364e+02 1.004e+03 1.411e+03 2.308e+03 3.873e+03, threshold=2.822e+03, percent-clipped=11.0 2023-06-28 09:45:56,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2028246.0, ans=0.125 2023-06-28 09:47:02,102 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2028426.0, ans=0.1 2023-06-28 09:47:21,870 INFO [train.py:996] (3/4) Epoch 12, batch 2650, loss[loss=0.2013, simple_loss=0.2936, pruned_loss=0.05454, over 21852.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2881, pruned_loss=0.06764, over 4278887.03 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:48:03,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2028666.0, ans=0.0 2023-06-28 09:48:14,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2028666.0, ans=0.0 2023-06-28 09:48:40,912 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-28 09:49:07,743 INFO [train.py:996] (3/4) Epoch 12, batch 2700, loss[loss=0.1957, simple_loss=0.2771, pruned_loss=0.05711, over 21848.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2881, pruned_loss=0.06695, over 4279577.90 frames. ], batch size: 333, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:49:25,941 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.017e+02 6.917e+02 8.915e+02 1.240e+03 3.062e+03, threshold=1.783e+03, percent-clipped=1.0 2023-06-28 09:49:36,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2028906.0, ans=0.0 2023-06-28 09:49:38,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2028906.0, ans=0.04949747468305833 2023-06-28 09:49:49,339 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2028966.0, ans=0.1 2023-06-28 09:50:00,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2028966.0, ans=0.125 2023-06-28 09:50:11,264 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2029026.0, ans=0.125 2023-06-28 09:50:51,156 INFO [train.py:996] (3/4) Epoch 12, batch 2750, loss[loss=0.2081, simple_loss=0.2705, pruned_loss=0.07282, over 21525.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2879, pruned_loss=0.06719, over 4271026.11 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:50:57,736 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=12.0 2023-06-28 09:52:20,550 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-28 09:52:43,516 INFO [train.py:996] (3/4) Epoch 12, batch 2800, loss[loss=0.2225, simple_loss=0.3232, pruned_loss=0.06089, over 21789.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2931, pruned_loss=0.06823, over 4268788.86 frames. ], batch size: 298, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 09:52:58,754 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.305e+02 8.151e+02 1.437e+03 2.226e+03 4.806e+03, threshold=2.874e+03, percent-clipped=38.0 2023-06-28 09:54:28,768 INFO [train.py:996] (3/4) Epoch 12, batch 2850, loss[loss=0.2244, simple_loss=0.3101, pruned_loss=0.06939, over 21634.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2946, pruned_loss=0.06985, over 4274110.38 frames. ], batch size: 389, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:54:44,789 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2029806.0, ans=0.035 2023-06-28 09:54:59,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2029806.0, ans=0.0 2023-06-28 09:56:02,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2029986.0, ans=0.0 2023-06-28 09:56:12,486 INFO [train.py:996] (3/4) Epoch 12, batch 2900, loss[loss=0.1997, simple_loss=0.278, pruned_loss=0.06069, over 21821.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.293, pruned_loss=0.06969, over 4271490.17 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:56:16,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2030046.0, ans=0.125 2023-06-28 09:56:18,269 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2030046.0, ans=0.5 2023-06-28 09:56:27,896 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.497e+02 8.368e+02 1.188e+03 2.037e+03 3.726e+03, threshold=2.377e+03, percent-clipped=4.0 2023-06-28 09:57:04,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2030166.0, ans=0.2 2023-06-28 09:57:56,773 INFO [train.py:996] (3/4) Epoch 12, batch 2950, loss[loss=0.1942, simple_loss=0.2752, pruned_loss=0.05663, over 21707.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2935, pruned_loss=0.06951, over 4282798.66 frames. ], batch size: 112, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:59:07,516 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-28 09:59:25,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2030586.0, ans=0.5 2023-06-28 09:59:41,607 INFO [train.py:996] (3/4) Epoch 12, batch 3000, loss[loss=0.2572, simple_loss=0.3292, pruned_loss=0.09263, over 21243.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2986, pruned_loss=0.0706, over 4280810.74 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 09:59:41,608 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 09:59:53,498 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.3262, 3.0088, 3.2355, 3.3788, 2.8792, 2.7922, 3.4367, 3.3824], device='cuda:3') 2023-06-28 10:00:03,548 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2539, simple_loss=0.3416, pruned_loss=0.08306, over 1796401.00 frames. 2023-06-28 10:00:03,549 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 10:00:13,065 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2030646.0, ans=0.1 2023-06-28 10:00:24,300 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.619e+02 8.195e+02 1.192e+03 1.732e+03 4.635e+03, threshold=2.384e+03, percent-clipped=12.0 2023-06-28 10:00:43,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2030706.0, ans=0.1 2023-06-28 10:01:26,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2030886.0, ans=0.125 2023-06-28 10:01:35,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2030886.0, ans=0.0 2023-06-28 10:01:42,907 INFO [train.py:996] (3/4) Epoch 12, batch 3050, loss[loss=0.1798, simple_loss=0.2639, pruned_loss=0.04779, over 21862.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2983, pruned_loss=0.0688, over 4287060.98 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:02:33,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2031066.0, ans=0.125 2023-06-28 10:02:37,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2031066.0, ans=0.0 2023-06-28 10:02:45,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2031066.0, ans=0.125 2023-06-28 10:03:11,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2031186.0, ans=0.0 2023-06-28 10:03:17,129 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.41 vs. limit=10.0 2023-06-28 10:03:37,789 INFO [train.py:996] (3/4) Epoch 12, batch 3100, loss[loss=0.2347, simple_loss=0.3, pruned_loss=0.08467, over 21714.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2976, pruned_loss=0.06802, over 4285426.30 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:03:43,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2031246.0, ans=0.125 2023-06-28 10:03:53,105 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=12.0 2023-06-28 10:03:57,010 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.760e+02 7.796e+02 1.121e+03 1.860e+03 4.097e+03, threshold=2.242e+03, percent-clipped=9.0 2023-06-28 10:04:01,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2031306.0, ans=0.0 2023-06-28 10:04:16,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2031306.0, ans=0.1 2023-06-28 10:04:25,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2031366.0, ans=0.04949747468305833 2023-06-28 10:04:42,285 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2031426.0, ans=0.125 2023-06-28 10:04:47,625 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2031426.0, ans=0.125 2023-06-28 10:05:27,749 INFO [train.py:996] (3/4) Epoch 12, batch 3150, loss[loss=0.2948, simple_loss=0.3576, pruned_loss=0.116, over 21434.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2987, pruned_loss=0.06876, over 4279741.37 frames. ], batch size: 471, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:06:00,345 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2031606.0, ans=0.0 2023-06-28 10:07:12,551 INFO [train.py:996] (3/4) Epoch 12, batch 3200, loss[loss=0.2157, simple_loss=0.3111, pruned_loss=0.06014, over 21695.00 frames. ], tot_loss[loss=0.217, simple_loss=0.298, pruned_loss=0.06801, over 4276421.36 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 10:07:26,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2031846.0, ans=0.2 2023-06-28 10:07:29,668 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2031846.0, ans=0.0 2023-06-28 10:07:32,467 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.058e+02 7.728e+02 1.156e+03 1.759e+03 4.154e+03, threshold=2.311e+03, percent-clipped=17.0 2023-06-28 10:09:00,237 INFO [train.py:996] (3/4) Epoch 12, batch 3250, loss[loss=0.1904, simple_loss=0.264, pruned_loss=0.05839, over 21392.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2994, pruned_loss=0.06886, over 4274515.04 frames. ], batch size: 211, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:09:19,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2032206.0, ans=0.125 2023-06-28 10:09:20,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2032206.0, ans=0.125 2023-06-28 10:09:41,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2032266.0, ans=0.125 2023-06-28 10:09:43,877 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.02 vs. limit=15.0 2023-06-28 10:10:39,260 INFO [train.py:996] (3/4) Epoch 12, batch 3300, loss[loss=0.1777, simple_loss=0.2677, pruned_loss=0.0438, over 21565.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2969, pruned_loss=0.069, over 4270933.19 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:10:44,685 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2032446.0, ans=0.125 2023-06-28 10:10:55,999 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.931e+02 8.073e+02 1.537e+03 2.186e+03 4.176e+03, threshold=3.073e+03, percent-clipped=21.0 2023-06-28 10:11:02,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=2032506.0, ans=22.5 2023-06-28 10:11:08,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2032506.0, ans=0.125 2023-06-28 10:11:25,107 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-28 10:12:20,090 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-28 10:12:23,344 INFO [train.py:996] (3/4) Epoch 12, batch 3350, loss[loss=0.2409, simple_loss=0.3196, pruned_loss=0.08112, over 21437.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2987, pruned_loss=0.06953, over 4268879.42 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:12:59,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=2032806.0, ans=15.0 2023-06-28 10:13:31,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2032926.0, ans=0.0 2023-06-28 10:14:05,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2033046.0, ans=0.1 2023-06-28 10:14:06,584 INFO [train.py:996] (3/4) Epoch 12, batch 3400, loss[loss=0.2137, simple_loss=0.2917, pruned_loss=0.06789, over 21856.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2989, pruned_loss=0.07039, over 4278323.43 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:14:15,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2033046.0, ans=0.025 2023-06-28 10:14:28,091 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.882e+02 7.652e+02 1.057e+03 1.709e+03 3.627e+03, threshold=2.113e+03, percent-clipped=2.0 2023-06-28 10:14:37,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=2033106.0, ans=0.2 2023-06-28 10:15:17,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2033226.0, ans=0.0 2023-06-28 10:15:50,725 INFO [train.py:996] (3/4) Epoch 12, batch 3450, loss[loss=0.2469, simple_loss=0.3246, pruned_loss=0.08462, over 21845.00 frames. ], tot_loss[loss=0.2163, simple_loss=0.2931, pruned_loss=0.06977, over 4283788.82 frames. ], batch size: 372, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:16:15,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-28 10:16:35,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2033406.0, ans=0.125 2023-06-28 10:16:42,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2033466.0, ans=0.025 2023-06-28 10:17:28,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2033586.0, ans=0.125 2023-06-28 10:17:32,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2033586.0, ans=0.1 2023-06-28 10:17:35,122 INFO [train.py:996] (3/4) Epoch 12, batch 3500, loss[loss=0.382, simple_loss=0.457, pruned_loss=0.1535, over 21461.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3, pruned_loss=0.07237, over 4282948.72 frames. ], batch size: 507, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:17:41,246 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-28 10:18:03,136 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.129e+02 8.730e+02 1.318e+03 1.854e+03 3.895e+03, threshold=2.636e+03, percent-clipped=20.0 2023-06-28 10:18:42,543 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2033766.0, ans=0.2 2023-06-28 10:18:58,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-28 10:19:23,748 INFO [train.py:996] (3/4) Epoch 12, batch 3550, loss[loss=0.2274, simple_loss=0.3163, pruned_loss=0.0693, over 21676.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3047, pruned_loss=0.07374, over 4276703.94 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:20:10,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.77 vs. limit=15.0 2023-06-28 10:20:17,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-06-28 10:20:38,743 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2034126.0, ans=0.0 2023-06-28 10:21:06,880 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2034246.0, ans=0.1 2023-06-28 10:21:12,826 INFO [train.py:996] (3/4) Epoch 12, batch 3600, loss[loss=0.196, simple_loss=0.2648, pruned_loss=0.06358, over 21197.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.299, pruned_loss=0.07321, over 4276603.54 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:21:25,186 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2034246.0, ans=0.0 2023-06-28 10:21:31,738 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.278e+02 7.988e+02 1.219e+03 1.896e+03 5.241e+03, threshold=2.438e+03, percent-clipped=11.0 2023-06-28 10:21:40,746 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2034306.0, ans=0.2 2023-06-28 10:22:01,212 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2034366.0, ans=0.125 2023-06-28 10:22:02,778 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:22:29,438 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2034486.0, ans=0.2 2023-06-28 10:22:31,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2034486.0, ans=0.1 2023-06-28 10:22:51,684 INFO [train.py:996] (3/4) Epoch 12, batch 3650, loss[loss=0.1869, simple_loss=0.2731, pruned_loss=0.05036, over 21756.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2989, pruned_loss=0.07336, over 4280423.00 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:23:08,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2034546.0, ans=0.125 2023-06-28 10:23:51,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2034726.0, ans=0.1 2023-06-28 10:24:01,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2034726.0, ans=0.0 2023-06-28 10:24:04,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2034726.0, ans=0.125 2023-06-28 10:24:27,206 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-28 10:24:33,948 INFO [train.py:996] (3/4) Epoch 12, batch 3700, loss[loss=0.2059, simple_loss=0.2881, pruned_loss=0.06179, over 21322.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2965, pruned_loss=0.07138, over 4279515.54 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:24:57,048 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.964e+02 7.431e+02 1.073e+03 1.535e+03 4.329e+03, threshold=2.147e+03, percent-clipped=8.0 2023-06-28 10:25:24,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2034966.0, ans=0.2 2023-06-28 10:25:47,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2035026.0, ans=0.0 2023-06-28 10:26:08,348 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2035086.0, ans=0.1 2023-06-28 10:26:17,524 INFO [train.py:996] (3/4) Epoch 12, batch 3750, loss[loss=0.2047, simple_loss=0.2885, pruned_loss=0.06041, over 19930.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2946, pruned_loss=0.07012, over 4280472.97 frames. ], batch size: 703, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:26:23,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2035146.0, ans=0.05 2023-06-28 10:27:06,624 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-28 10:27:18,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2035326.0, ans=0.0 2023-06-28 10:27:22,525 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.35 vs. limit=22.5 2023-06-28 10:27:51,020 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=22.5 2023-06-28 10:27:57,846 INFO [train.py:996] (3/4) Epoch 12, batch 3800, loss[loss=0.275, simple_loss=0.3298, pruned_loss=0.11, over 21404.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2923, pruned_loss=0.06843, over 4287693.02 frames. ], batch size: 509, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:28:21,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.57 vs. limit=22.5 2023-06-28 10:28:21,911 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.866e+02 7.287e+02 1.012e+03 1.468e+03 2.920e+03, threshold=2.024e+03, percent-clipped=9.0 2023-06-28 10:28:30,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2035506.0, ans=0.125 2023-06-28 10:29:38,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-28 10:29:40,093 INFO [train.py:996] (3/4) Epoch 12, batch 3850, loss[loss=0.1912, simple_loss=0.255, pruned_loss=0.06371, over 21184.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2911, pruned_loss=0.06908, over 4283604.20 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:29:50,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2035746.0, ans=0.1 2023-06-28 10:30:04,269 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:30:13,315 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=22.5 2023-06-28 10:30:21,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2035866.0, ans=0.09899494936611666 2023-06-28 10:30:26,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2035866.0, ans=0.1 2023-06-28 10:31:23,410 INFO [train.py:996] (3/4) Epoch 12, batch 3900, loss[loss=0.192, simple_loss=0.2661, pruned_loss=0.05892, over 21815.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2872, pruned_loss=0.06867, over 4275814.80 frames. ], batch size: 112, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:31:39,707 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2036106.0, ans=0.2 2023-06-28 10:31:47,273 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.137e+02 7.075e+02 9.098e+02 1.343e+03 3.131e+03, threshold=1.820e+03, percent-clipped=11.0 2023-06-28 10:33:08,679 INFO [train.py:996] (3/4) Epoch 12, batch 3950, loss[loss=0.1725, simple_loss=0.2652, pruned_loss=0.03986, over 21730.00 frames. ], tot_loss[loss=0.2118, simple_loss=0.2886, pruned_loss=0.06747, over 4273193.43 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:33:34,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2036406.0, ans=0.0 2023-06-28 10:33:38,340 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.13 vs. limit=15.0 2023-06-28 10:34:52,695 INFO [train.py:996] (3/4) Epoch 12, batch 4000, loss[loss=0.1607, simple_loss=0.2368, pruned_loss=0.04226, over 21643.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2819, pruned_loss=0.06489, over 4267317.51 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 10:35:16,138 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.215e+02 7.767e+02 1.100e+03 1.663e+03 3.671e+03, threshold=2.200e+03, percent-clipped=20.0 2023-06-28 10:35:31,532 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=15.0 2023-06-28 10:36:35,202 INFO [train.py:996] (3/4) Epoch 12, batch 4050, loss[loss=0.1871, simple_loss=0.2791, pruned_loss=0.04752, over 21770.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2817, pruned_loss=0.06338, over 4274698.20 frames. ], batch size: 298, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:36:52,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2036946.0, ans=0.1 2023-06-28 10:37:07,262 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2037006.0, ans=0.0 2023-06-28 10:37:10,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2037006.0, ans=0.0 2023-06-28 10:37:29,222 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-28 10:37:46,491 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2037126.0, ans=0.025 2023-06-28 10:38:18,406 INFO [train.py:996] (3/4) Epoch 12, batch 4100, loss[loss=0.1952, simple_loss=0.276, pruned_loss=0.05724, over 21275.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2832, pruned_loss=0.06386, over 4284344.65 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:38:25,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2037246.0, ans=0.1 2023-06-28 10:38:27,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2037246.0, ans=0.125 2023-06-28 10:38:45,583 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.840e+02 7.700e+02 1.227e+03 1.924e+03 4.359e+03, threshold=2.455e+03, percent-clipped=14.0 2023-06-28 10:40:06,880 INFO [train.py:996] (3/4) Epoch 12, batch 4150, loss[loss=0.2147, simple_loss=0.2817, pruned_loss=0.07384, over 20022.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2839, pruned_loss=0.06207, over 4266109.24 frames. ], batch size: 703, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:40:40,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2037666.0, ans=0.0 2023-06-28 10:41:52,375 INFO [train.py:996] (3/4) Epoch 12, batch 4200, loss[loss=0.1905, simple_loss=0.2605, pruned_loss=0.06027, over 21287.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2861, pruned_loss=0.06262, over 4259697.22 frames. ], batch size: 144, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:41:54,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2037846.0, ans=0.0 2023-06-28 10:42:14,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.777e+02 8.293e+02 1.484e+03 2.185e+03 3.637e+03, threshold=2.967e+03, percent-clipped=18.0 2023-06-28 10:42:15,724 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=12.0 2023-06-28 10:43:07,852 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-28 10:43:10,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2038026.0, ans=0.125 2023-06-28 10:43:10,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2038026.0, ans=0.125 2023-06-28 10:43:37,190 INFO [train.py:996] (3/4) Epoch 12, batch 4250, loss[loss=0.2779, simple_loss=0.3559, pruned_loss=0.09996, over 21411.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2911, pruned_loss=0.06427, over 4257779.21 frames. ], batch size: 471, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:43:37,905 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2038146.0, ans=0.125 2023-06-28 10:45:07,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2038386.0, ans=0.1 2023-06-28 10:45:24,209 INFO [train.py:996] (3/4) Epoch 12, batch 4300, loss[loss=0.1416, simple_loss=0.1841, pruned_loss=0.04953, over 17344.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2958, pruned_loss=0.06556, over 4249563.95 frames. ], batch size: 62, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:45:38,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2038446.0, ans=0.125 2023-06-28 10:46:00,805 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.611e+02 9.355e+02 1.305e+03 1.983e+03 5.098e+03, threshold=2.609e+03, percent-clipped=8.0 2023-06-28 10:46:08,238 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2038506.0, ans=0.5 2023-06-28 10:46:15,685 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=15.0 2023-06-28 10:46:29,023 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-28 10:46:34,036 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.15 vs. limit=22.5 2023-06-28 10:46:45,402 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-28 10:46:51,764 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2038686.0, ans=0.125 2023-06-28 10:47:12,493 INFO [train.py:996] (3/4) Epoch 12, batch 4350, loss[loss=0.1799, simple_loss=0.2449, pruned_loss=0.05745, over 21563.00 frames. ], tot_loss[loss=0.213, simple_loss=0.2952, pruned_loss=0.0654, over 4252423.15 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 8.0 2023-06-28 10:47:23,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2038746.0, ans=0.125 2023-06-28 10:47:31,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=2038746.0, ans=0.025 2023-06-28 10:47:43,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.65 vs. limit=22.5 2023-06-28 10:48:14,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2038866.0, ans=0.125 2023-06-28 10:48:31,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-28 10:49:03,164 INFO [train.py:996] (3/4) Epoch 12, batch 4400, loss[loss=0.2103, simple_loss=0.3057, pruned_loss=0.05747, over 21574.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2914, pruned_loss=0.06498, over 4258999.03 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:49:15,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-28 10:49:18,659 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2039046.0, ans=0.0 2023-06-28 10:49:35,019 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.000e+02 1.052e+03 1.456e+03 1.843e+03 4.869e+03, threshold=2.912e+03, percent-clipped=14.0 2023-06-28 10:49:40,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2039106.0, ans=0.0 2023-06-28 10:49:48,179 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2039166.0, ans=0.0 2023-06-28 10:49:51,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2039166.0, ans=0.0 2023-06-28 10:50:19,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2039226.0, ans=0.035 2023-06-28 10:50:19,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2039226.0, ans=0.0 2023-06-28 10:50:53,925 INFO [train.py:996] (3/4) Epoch 12, batch 4450, loss[loss=0.2546, simple_loss=0.3633, pruned_loss=0.07296, over 21261.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2982, pruned_loss=0.06617, over 4260128.82 frames. ], batch size: 549, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:51:20,809 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2039406.0, ans=0.2 2023-06-28 10:51:57,012 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2039526.0, ans=0.125 2023-06-28 10:52:00,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2039526.0, ans=0.125 2023-06-28 10:52:38,110 INFO [train.py:996] (3/4) Epoch 12, batch 4500, loss[loss=0.2252, simple_loss=0.3143, pruned_loss=0.06807, over 21751.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.3003, pruned_loss=0.06836, over 4264095.37 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:52:58,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2039706.0, ans=0.125 2023-06-28 10:53:04,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.459e+02 9.304e+02 1.246e+03 2.301e+03 3.917e+03, threshold=2.492e+03, percent-clipped=11.0 2023-06-28 10:53:30,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2039766.0, ans=0.2 2023-06-28 10:53:50,004 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:53:55,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=8.0 2023-06-28 10:54:28,126 INFO [train.py:996] (3/4) Epoch 12, batch 4550, loss[loss=0.2484, simple_loss=0.33, pruned_loss=0.08339, over 21932.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.3024, pruned_loss=0.06787, over 4263163.62 frames. ], batch size: 372, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:54:31,192 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=12.0 2023-06-28 10:54:38,399 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=2039946.0, ans=12.0 2023-06-28 10:55:08,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2040066.0, ans=0.07 2023-06-28 10:55:11,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2040066.0, ans=0.0 2023-06-28 10:56:14,155 INFO [train.py:996] (3/4) Epoch 12, batch 4600, loss[loss=0.2533, simple_loss=0.3216, pruned_loss=0.09255, over 21330.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3036, pruned_loss=0.06872, over 4266668.90 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:56:23,432 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 10:56:30,667 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2040306.0, ans=0.125 2023-06-28 10:56:36,680 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.360e+02 7.467e+02 1.139e+03 1.677e+03 2.825e+03, threshold=2.277e+03, percent-clipped=5.0 2023-06-28 10:57:03,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2040366.0, ans=0.1 2023-06-28 10:57:09,051 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.45 vs. limit=15.0 2023-06-28 10:57:55,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2040486.0, ans=0.0 2023-06-28 10:57:58,182 INFO [train.py:996] (3/4) Epoch 12, batch 4650, loss[loss=0.1595, simple_loss=0.2457, pruned_loss=0.03661, over 21782.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2989, pruned_loss=0.068, over 4277328.63 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:58:01,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2040546.0, ans=0.0 2023-06-28 10:58:02,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2040546.0, ans=0.125 2023-06-28 10:59:02,903 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2040726.0, ans=0.125 2023-06-28 10:59:03,443 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=15.65 vs. limit=15.0 2023-06-28 10:59:40,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2040846.0, ans=15.0 2023-06-28 10:59:40,591 INFO [train.py:996] (3/4) Epoch 12, batch 4700, loss[loss=0.2244, simple_loss=0.3441, pruned_loss=0.05229, over 20804.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2917, pruned_loss=0.06573, over 4266867.90 frames. ], batch size: 607, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 10:59:51,724 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2040846.0, ans=0.125 2023-06-28 11:00:07,732 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.876e+02 7.682e+02 1.181e+03 1.934e+03 4.585e+03, threshold=2.362e+03, percent-clipped=15.0 2023-06-28 11:00:16,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2040906.0, ans=0.1 2023-06-28 11:01:04,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2041086.0, ans=0.125 2023-06-28 11:01:11,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2041086.0, ans=0.015 2023-06-28 11:01:23,165 INFO [train.py:996] (3/4) Epoch 12, batch 4750, loss[loss=0.2235, simple_loss=0.2961, pruned_loss=0.07545, over 21818.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2873, pruned_loss=0.06554, over 4262569.03 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 11:01:31,053 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-28 11:01:33,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2041146.0, ans=0.95 2023-06-28 11:03:02,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2041386.0, ans=0.1 2023-06-28 11:03:05,685 INFO [train.py:996] (3/4) Epoch 12, batch 4800, loss[loss=0.2102, simple_loss=0.29, pruned_loss=0.06513, over 21871.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2887, pruned_loss=0.06677, over 4272314.30 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 11:03:32,402 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.216e+02 8.084e+02 1.278e+03 1.855e+03 4.015e+03, threshold=2.556e+03, percent-clipped=12.0 2023-06-28 11:03:50,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2041566.0, ans=0.0 2023-06-28 11:03:54,216 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2041566.0, ans=0.0 2023-06-28 11:04:26,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2041686.0, ans=0.125 2023-06-28 11:04:47,269 INFO [train.py:996] (3/4) Epoch 12, batch 4850, loss[loss=0.2213, simple_loss=0.3436, pruned_loss=0.04949, over 20799.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.288, pruned_loss=0.06578, over 4271238.81 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 32.0 2023-06-28 11:05:15,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2041806.0, ans=0.125 2023-06-28 11:05:35,953 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2041866.0, ans=0.125 2023-06-28 11:06:30,302 INFO [train.py:996] (3/4) Epoch 12, batch 4900, loss[loss=0.214, simple_loss=0.3148, pruned_loss=0.05657, over 19876.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2892, pruned_loss=0.06528, over 4271494.66 frames. ], batch size: 703, lr: 2.46e-03, grad_scale: 16.0 2023-06-28 11:06:58,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.185e+02 7.394e+02 1.193e+03 1.925e+03 4.019e+03, threshold=2.386e+03, percent-clipped=10.0 2023-06-28 11:07:12,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2042166.0, ans=0.0 2023-06-28 11:07:34,564 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2042226.0, ans=0.025 2023-06-28 11:08:14,025 INFO [train.py:996] (3/4) Epoch 12, batch 4950, loss[loss=0.1664, simple_loss=0.2429, pruned_loss=0.04495, over 21792.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2921, pruned_loss=0.06457, over 4273595.06 frames. ], batch size: 124, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:08:30,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2042346.0, ans=0.0 2023-06-28 11:08:58,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2042466.0, ans=0.125 2023-06-28 11:09:37,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2042586.0, ans=0.125 2023-06-28 11:09:54,799 INFO [train.py:996] (3/4) Epoch 12, batch 5000, loss[loss=0.2357, simple_loss=0.3545, pruned_loss=0.05839, over 20751.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2923, pruned_loss=0.06231, over 4275131.43 frames. ], batch size: 607, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:10:18,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2042706.0, ans=0.0 2023-06-28 11:10:22,968 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.667e+02 7.098e+02 1.009e+03 1.573e+03 3.184e+03, threshold=2.017e+03, percent-clipped=11.0 2023-06-28 11:10:33,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2042706.0, ans=0.0 2023-06-28 11:10:49,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2042766.0, ans=0.1 2023-06-28 11:11:15,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-28 11:11:26,653 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-28 11:11:35,585 INFO [train.py:996] (3/4) Epoch 12, batch 5050, loss[loss=0.2708, simple_loss=0.3155, pruned_loss=0.1131, over 21783.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.293, pruned_loss=0.06429, over 4278138.70 frames. ], batch size: 508, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:11:53,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2042946.0, ans=0.0 2023-06-28 11:11:56,372 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.08 vs. limit=10.0 2023-06-28 11:12:33,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2043126.0, ans=0.1 2023-06-28 11:13:16,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2043246.0, ans=0.0 2023-06-28 11:13:17,753 INFO [train.py:996] (3/4) Epoch 12, batch 5100, loss[loss=0.198, simple_loss=0.2766, pruned_loss=0.05972, over 21373.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.29, pruned_loss=0.06437, over 4287394.94 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:13:28,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2043246.0, ans=10.0 2023-06-28 11:13:28,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2043246.0, ans=0.2 2023-06-28 11:13:45,412 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.047e+02 7.915e+02 1.019e+03 1.431e+03 3.420e+03, threshold=2.039e+03, percent-clipped=6.0 2023-06-28 11:13:45,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2043306.0, ans=0.125 2023-06-28 11:14:05,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2043366.0, ans=0.125 2023-06-28 11:14:30,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2043426.0, ans=0.0 2023-06-28 11:14:39,689 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=22.5 2023-06-28 11:15:00,446 INFO [train.py:996] (3/4) Epoch 12, batch 5150, loss[loss=0.2064, simple_loss=0.2683, pruned_loss=0.07222, over 20061.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2879, pruned_loss=0.06472, over 4290047.87 frames. ], batch size: 703, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:15:01,515 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-28 11:15:21,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2043606.0, ans=0.2 2023-06-28 11:15:23,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2043606.0, ans=0.04949747468305833 2023-06-28 11:16:07,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2043726.0, ans=0.1 2023-06-28 11:16:34,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2043786.0, ans=0.125 2023-06-28 11:16:44,541 INFO [train.py:996] (3/4) Epoch 12, batch 5200, loss[loss=0.2501, simple_loss=0.353, pruned_loss=0.07359, over 21660.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2902, pruned_loss=0.06604, over 4292110.31 frames. ], batch size: 389, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 11:17:18,855 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.436e+02 7.444e+02 1.331e+03 2.729e+03 6.291e+03, threshold=2.663e+03, percent-clipped=30.0 2023-06-28 11:18:26,299 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-28 11:18:26,589 INFO [train.py:996] (3/4) Epoch 12, batch 5250, loss[loss=0.2067, simple_loss=0.2984, pruned_loss=0.05751, over 21786.00 frames. ], tot_loss[loss=0.2124, simple_loss=0.2947, pruned_loss=0.06504, over 4292508.04 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:19:00,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2044206.0, ans=0.1 2023-06-28 11:19:50,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2044386.0, ans=0.0 2023-06-28 11:20:00,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2044386.0, ans=0.2 2023-06-28 11:20:04,330 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-28 11:20:08,101 INFO [train.py:996] (3/4) Epoch 12, batch 5300, loss[loss=0.2284, simple_loss=0.2931, pruned_loss=0.08186, over 21831.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.294, pruned_loss=0.06567, over 4285775.69 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:20:42,497 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.313e+02 7.509e+02 1.039e+03 1.571e+03 3.451e+03, threshold=2.078e+03, percent-clipped=7.0 2023-06-28 11:20:51,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2044566.0, ans=0.125 2023-06-28 11:21:48,594 INFO [train.py:996] (3/4) Epoch 12, batch 5350, loss[loss=0.1826, simple_loss=0.2529, pruned_loss=0.05613, over 21679.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2919, pruned_loss=0.06669, over 4290883.68 frames. ], batch size: 230, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:21:56,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2044746.0, ans=0.125 2023-06-28 11:22:28,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2044806.0, ans=10.0 2023-06-28 11:22:40,129 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:23:03,123 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2044926.0, ans=0.2 2023-06-28 11:23:04,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2044926.0, ans=0.125 2023-06-28 11:23:18,826 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.77 vs. limit=15.0 2023-06-28 11:23:35,436 INFO [train.py:996] (3/4) Epoch 12, batch 5400, loss[loss=0.1813, simple_loss=0.2634, pruned_loss=0.04964, over 21843.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2889, pruned_loss=0.06647, over 4285102.86 frames. ], batch size: 316, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:23:41,608 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2045046.0, ans=0.125 2023-06-28 11:23:56,570 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2045106.0, ans=0.125 2023-06-28 11:24:05,988 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.403e+02 8.319e+02 1.196e+03 1.782e+03 3.222e+03, threshold=2.392e+03, percent-clipped=18.0 2023-06-28 11:25:00,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.91 vs. limit=15.0 2023-06-28 11:25:08,598 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2045286.0, ans=0.125 2023-06-28 11:25:19,468 INFO [train.py:996] (3/4) Epoch 12, batch 5450, loss[loss=0.2127, simple_loss=0.3078, pruned_loss=0.05884, over 21850.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2895, pruned_loss=0.06508, over 4285845.48 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:25:33,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2045346.0, ans=0.125 2023-06-28 11:25:43,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2045406.0, ans=0.95 2023-06-28 11:25:55,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2045406.0, ans=0.125 2023-06-28 11:26:10,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2045466.0, ans=0.0 2023-06-28 11:26:33,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2045526.0, ans=0.1 2023-06-28 11:26:42,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2045526.0, ans=0.1 2023-06-28 11:26:56,801 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=15.0 2023-06-28 11:27:08,762 INFO [train.py:996] (3/4) Epoch 12, batch 5500, loss[loss=0.2028, simple_loss=0.3007, pruned_loss=0.05244, over 21347.00 frames. ], tot_loss[loss=0.21, simple_loss=0.2945, pruned_loss=0.06273, over 4274419.27 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:27:14,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2045646.0, ans=0.025 2023-06-28 11:27:18,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2045646.0, ans=0.2 2023-06-28 11:27:19,066 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=12.0 2023-06-28 11:27:44,011 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.944e+02 8.580e+02 1.207e+03 1.863e+03 4.637e+03, threshold=2.413e+03, percent-clipped=15.0 2023-06-28 11:28:04,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2045766.0, ans=0.125 2023-06-28 11:28:57,715 INFO [train.py:996] (3/4) Epoch 12, batch 5550, loss[loss=0.17, simple_loss=0.271, pruned_loss=0.03454, over 21684.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2943, pruned_loss=0.05988, over 4272133.01 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:29:20,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2046006.0, ans=0.2 2023-06-28 11:29:33,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2046006.0, ans=0.1 2023-06-28 11:29:49,100 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=22.5 2023-06-28 11:29:50,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2046066.0, ans=0.07 2023-06-28 11:30:22,864 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-28 11:30:24,671 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=22.5 2023-06-28 11:30:24,779 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=3.67 vs. limit=15.0 2023-06-28 11:30:46,193 INFO [train.py:996] (3/4) Epoch 12, batch 5600, loss[loss=0.2362, simple_loss=0.3296, pruned_loss=0.07135, over 21773.00 frames. ], tot_loss[loss=0.2053, simple_loss=0.2937, pruned_loss=0.05843, over 4274073.02 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 11:31:13,170 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 9.110e+02 1.414e+03 2.313e+03 5.859e+03, threshold=2.829e+03, percent-clipped=23.0 2023-06-28 11:31:41,005 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.95 vs. limit=15.0 2023-06-28 11:32:11,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2046486.0, ans=0.125 2023-06-28 11:32:27,080 INFO [train.py:996] (3/4) Epoch 12, batch 5650, loss[loss=0.2367, simple_loss=0.3087, pruned_loss=0.08239, over 21858.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2958, pruned_loss=0.06036, over 4283352.75 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:32:39,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2046546.0, ans=0.07 2023-06-28 11:33:05,699 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:33:23,475 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2046666.0, ans=0.125 2023-06-28 11:33:25,214 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 11:34:06,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2046786.0, ans=0.1 2023-06-28 11:34:09,891 INFO [train.py:996] (3/4) Epoch 12, batch 5700, loss[loss=0.1973, simple_loss=0.271, pruned_loss=0.06181, over 21214.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2945, pruned_loss=0.06171, over 4283102.80 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:34:41,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2046906.0, ans=0.1 2023-06-28 11:34:42,200 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.668e+02 8.634e+02 1.270e+03 1.811e+03 3.578e+03, threshold=2.540e+03, percent-clipped=6.0 2023-06-28 11:35:02,338 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=12.0 2023-06-28 11:35:29,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2047026.0, ans=0.0 2023-06-28 11:35:35,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-28 11:35:35,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-28 11:35:36,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2047086.0, ans=0.125 2023-06-28 11:35:54,497 INFO [train.py:996] (3/4) Epoch 12, batch 5750, loss[loss=0.1718, simple_loss=0.2721, pruned_loss=0.03581, over 21647.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2939, pruned_loss=0.0592, over 4277734.45 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:36:35,937 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-28 11:36:50,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2047266.0, ans=0.125 2023-06-28 11:37:43,055 INFO [train.py:996] (3/4) Epoch 12, batch 5800, loss[loss=0.2367, simple_loss=0.3374, pruned_loss=0.06802, over 21661.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2921, pruned_loss=0.05768, over 4276793.92 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:37:57,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-28 11:37:58,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.00 vs. limit=22.5 2023-06-28 11:38:14,538 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.621e+02 6.881e+02 1.222e+03 1.758e+03 3.677e+03, threshold=2.444e+03, percent-clipped=11.0 2023-06-28 11:38:41,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2047566.0, ans=0.0 2023-06-28 11:38:45,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2047566.0, ans=0.125 2023-06-28 11:39:05,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2047686.0, ans=0.2 2023-06-28 11:39:19,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff2.min_abs, batch_count=2047686.0, ans=0.1 2023-06-28 11:39:19,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2047686.0, ans=0.2 2023-06-28 11:39:31,898 INFO [train.py:996] (3/4) Epoch 12, batch 5850, loss[loss=0.1615, simple_loss=0.2665, pruned_loss=0.02821, over 21762.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2893, pruned_loss=0.05483, over 4276523.93 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:40:21,791 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2047866.0, ans=0.125 2023-06-28 11:40:29,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.81 vs. limit=15.0 2023-06-28 11:40:30,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2047926.0, ans=0.2 2023-06-28 11:40:34,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2047926.0, ans=0.0 2023-06-28 11:40:56,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-28 11:41:08,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2048046.0, ans=0.0 2023-06-28 11:41:08,978 INFO [train.py:996] (3/4) Epoch 12, batch 5900, loss[loss=0.1577, simple_loss=0.2397, pruned_loss=0.03786, over 21411.00 frames. ], tot_loss[loss=0.1913, simple_loss=0.2816, pruned_loss=0.05055, over 4273525.60 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:41:27,288 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2048046.0, ans=0.125 2023-06-28 11:41:44,137 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 9.930e+02 1.759e+03 2.367e+03 3.954e+03, threshold=3.519e+03, percent-clipped=21.0 2023-06-28 11:41:54,334 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2048166.0, ans=0.1 2023-06-28 11:42:38,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2048286.0, ans=0.5 2023-06-28 11:42:54,193 INFO [train.py:996] (3/4) Epoch 12, batch 5950, loss[loss=0.2001, simple_loss=0.2643, pruned_loss=0.06792, over 21282.00 frames. ], tot_loss[loss=0.1941, simple_loss=0.2818, pruned_loss=0.05325, over 4282541.52 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:43:53,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-28 11:44:36,726 INFO [train.py:996] (3/4) Epoch 12, batch 6000, loss[loss=0.1827, simple_loss=0.24, pruned_loss=0.06274, over 21221.00 frames. ], tot_loss[loss=0.1952, simple_loss=0.279, pruned_loss=0.05568, over 4280733.73 frames. ], batch size: 551, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 11:44:36,727 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 11:44:57,247 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2597, simple_loss=0.3509, pruned_loss=0.08424, over 1796401.00 frames. 2023-06-28 11:44:57,248 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 11:45:10,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.06 vs. limit=15.0 2023-06-28 11:45:11,636 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-28 11:45:21,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2048706.0, ans=0.125 2023-06-28 11:45:28,552 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.837e+02 9.369e+02 1.291e+03 2.028e+03 3.757e+03, threshold=2.582e+03, percent-clipped=1.0 2023-06-28 11:45:29,315 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2048706.0, ans=0.1 2023-06-28 11:45:37,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2048766.0, ans=0.125 2023-06-28 11:46:40,047 INFO [train.py:996] (3/4) Epoch 12, batch 6050, loss[loss=0.2198, simple_loss=0.2709, pruned_loss=0.08433, over 21406.00 frames. ], tot_loss[loss=0.1938, simple_loss=0.2746, pruned_loss=0.05655, over 4277824.63 frames. ], batch size: 476, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:48:12,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2049186.0, ans=0.025 2023-06-28 11:48:15,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2049186.0, ans=0.125 2023-06-28 11:48:28,635 INFO [train.py:996] (3/4) Epoch 12, batch 6100, loss[loss=0.2221, simple_loss=0.2981, pruned_loss=0.07308, over 21786.00 frames. ], tot_loss[loss=0.1934, simple_loss=0.2744, pruned_loss=0.0562, over 4283477.04 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:48:44,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-28 11:48:57,065 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 8.433e+02 1.328e+03 2.179e+03 5.742e+03, threshold=2.657e+03, percent-clipped=17.0 2023-06-28 11:48:59,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2049306.0, ans=0.125 2023-06-28 11:49:56,691 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=12.0 2023-06-28 11:50:13,343 INFO [train.py:996] (3/4) Epoch 12, batch 6150, loss[loss=0.1924, simple_loss=0.3019, pruned_loss=0.04143, over 20852.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2753, pruned_loss=0.05809, over 4277085.85 frames. ], batch size: 608, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:50:20,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2049546.0, ans=0.0 2023-06-28 11:50:39,523 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-28 11:51:14,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2049726.0, ans=0.0 2023-06-28 11:51:25,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2049726.0, ans=0.2 2023-06-28 11:51:55,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2049846.0, ans=0.0 2023-06-28 11:51:56,277 INFO [train.py:996] (3/4) Epoch 12, batch 6200, loss[loss=0.1815, simple_loss=0.2579, pruned_loss=0.05254, over 21520.00 frames. ], tot_loss[loss=0.1982, simple_loss=0.2787, pruned_loss=0.05886, over 4280938.96 frames. ], batch size: 212, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:52:32,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.502e+02 7.829e+02 1.153e+03 1.728e+03 4.252e+03, threshold=2.307e+03, percent-clipped=8.0 2023-06-28 11:52:33,524 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.55 vs. limit=6.0 2023-06-28 11:52:41,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2049966.0, ans=0.2 2023-06-28 11:53:41,396 INFO [train.py:996] (3/4) Epoch 12, batch 6250, loss[loss=0.1794, simple_loss=0.2818, pruned_loss=0.03852, over 21717.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2833, pruned_loss=0.05855, over 4280329.48 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 8.0 2023-06-28 11:54:51,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2050326.0, ans=0.2 2023-06-28 11:55:19,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2050386.0, ans=0.1 2023-06-28 11:55:23,839 INFO [train.py:996] (3/4) Epoch 12, batch 6300, loss[loss=0.1951, simple_loss=0.2709, pruned_loss=0.05962, over 21867.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2867, pruned_loss=0.05787, over 4282018.69 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 8.0 2023-06-28 11:55:42,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2050446.0, ans=0.0 2023-06-28 11:55:56,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2050506.0, ans=0.1 2023-06-28 11:56:03,345 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.650e+02 7.162e+02 1.070e+03 1.625e+03 2.845e+03, threshold=2.140e+03, percent-clipped=5.0 2023-06-28 11:57:05,259 INFO [train.py:996] (3/4) Epoch 12, batch 6350, loss[loss=0.2511, simple_loss=0.3324, pruned_loss=0.08494, over 21407.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2891, pruned_loss=0.06118, over 4285208.85 frames. ], batch size: 131, lr: 2.45e-03, grad_scale: 8.0 2023-06-28 11:58:25,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2050926.0, ans=0.0 2023-06-28 11:58:49,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2050986.0, ans=0.125 2023-06-28 11:58:54,044 INFO [train.py:996] (3/4) Epoch 12, batch 6400, loss[loss=0.2438, simple_loss=0.3199, pruned_loss=0.08384, over 21530.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2956, pruned_loss=0.06566, over 4286208.48 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 11:59:29,778 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.784e+02 8.222e+02 1.150e+03 1.542e+03 3.199e+03, threshold=2.299e+03, percent-clipped=10.0 2023-06-28 11:59:44,079 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-28 12:00:01,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2051226.0, ans=0.125 2023-06-28 12:00:06,631 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2051226.0, ans=0.0 2023-06-28 12:00:35,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2051346.0, ans=0.2 2023-06-28 12:00:36,723 INFO [train.py:996] (3/4) Epoch 12, batch 6450, loss[loss=0.2137, simple_loss=0.3109, pruned_loss=0.05829, over 21616.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2987, pruned_loss=0.06613, over 4290337.31 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:00:45,768 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2051346.0, ans=0.5 2023-06-28 12:01:00,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2051406.0, ans=0.07 2023-06-28 12:01:31,404 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-28 12:02:05,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2051586.0, ans=0.125 2023-06-28 12:02:09,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2051586.0, ans=0.2 2023-06-28 12:02:16,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2051586.0, ans=0.04949747468305833 2023-06-28 12:02:20,325 INFO [train.py:996] (3/4) Epoch 12, batch 6500, loss[loss=0.1668, simple_loss=0.2357, pruned_loss=0.04897, over 21545.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2893, pruned_loss=0.06412, over 4284227.57 frames. ], batch size: 231, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:02:59,802 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.006e+02 7.341e+02 1.379e+03 1.907e+03 4.704e+03, threshold=2.757e+03, percent-clipped=17.0 2023-06-28 12:03:21,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-28 12:03:24,653 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-28 12:03:45,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2051886.0, ans=0.125 2023-06-28 12:04:03,599 INFO [train.py:996] (3/4) Epoch 12, batch 6550, loss[loss=0.2164, simple_loss=0.2904, pruned_loss=0.07118, over 21787.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2875, pruned_loss=0.06297, over 4281728.01 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:04:06,407 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-28 12:05:24,185 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-06-28 12:05:36,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2052186.0, ans=0.1 2023-06-28 12:05:44,396 INFO [train.py:996] (3/4) Epoch 12, batch 6600, loss[loss=0.1972, simple_loss=0.2651, pruned_loss=0.06467, over 21645.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2824, pruned_loss=0.06297, over 4261066.57 frames. ], batch size: 416, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:06:28,657 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.752e+02 7.717e+02 1.174e+03 1.589e+03 2.955e+03, threshold=2.349e+03, percent-clipped=1.0 2023-06-28 12:06:47,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2052426.0, ans=0.125 2023-06-28 12:07:05,752 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2052426.0, ans=0.1 2023-06-28 12:07:32,096 INFO [train.py:996] (3/4) Epoch 12, batch 6650, loss[loss=0.1845, simple_loss=0.2543, pruned_loss=0.05731, over 21264.00 frames. ], tot_loss[loss=0.1998, simple_loss=0.2775, pruned_loss=0.06109, over 4266271.62 frames. ], batch size: 551, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:07:34,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2052546.0, ans=0.0 2023-06-28 12:08:17,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.23 vs. limit=22.5 2023-06-28 12:08:38,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2052726.0, ans=0.04949747468305833 2023-06-28 12:08:47,381 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2052726.0, ans=0.0 2023-06-28 12:08:53,175 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=15.0 2023-06-28 12:09:03,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2052786.0, ans=0.1 2023-06-28 12:09:12,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2052846.0, ans=0.2 2023-06-28 12:09:13,031 INFO [train.py:996] (3/4) Epoch 12, batch 6700, loss[loss=0.1906, simple_loss=0.2612, pruned_loss=0.05996, over 21706.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.2735, pruned_loss=0.06086, over 4254180.15 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:09:52,367 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.639e+02 7.163e+02 1.028e+03 1.473e+03 3.561e+03, threshold=2.056e+03, percent-clipped=9.0 2023-06-28 12:10:17,814 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=12.0 2023-06-28 12:10:47,013 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=2053086.0, ans=6.0 2023-06-28 12:10:49,638 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2053086.0, ans=0.0 2023-06-28 12:10:53,902 INFO [train.py:996] (3/4) Epoch 12, batch 6750, loss[loss=0.1701, simple_loss=0.2432, pruned_loss=0.04846, over 21449.00 frames. ], tot_loss[loss=0.1976, simple_loss=0.273, pruned_loss=0.06114, over 4252791.88 frames. ], batch size: 212, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:11:11,050 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-28 12:11:19,149 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-28 12:11:31,515 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:12:20,743 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.22 vs. limit=22.5 2023-06-28 12:12:33,673 INFO [train.py:996] (3/4) Epoch 12, batch 6800, loss[loss=0.188, simple_loss=0.2642, pruned_loss=0.05589, over 21094.00 frames. ], tot_loss[loss=0.2004, simple_loss=0.2748, pruned_loss=0.06303, over 4257446.63 frames. ], batch size: 607, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 12:12:34,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2053446.0, ans=0.0 2023-06-28 12:13:13,851 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.991e+02 6.929e+02 1.207e+03 2.029e+03 5.012e+03, threshold=2.414e+03, percent-clipped=24.0 2023-06-28 12:13:53,988 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2053686.0, ans=0.125 2023-06-28 12:14:00,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2053686.0, ans=0.0 2023-06-28 12:14:14,466 INFO [train.py:996] (3/4) Epoch 12, batch 6850, loss[loss=0.1964, simple_loss=0.2633, pruned_loss=0.06475, over 21822.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2723, pruned_loss=0.06375, over 4265397.13 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:14:41,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.19 vs. limit=6.0 2023-06-28 12:14:57,599 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2053866.0, ans=0.125 2023-06-28 12:15:21,002 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2053926.0, ans=0.125 2023-06-28 12:15:55,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2053986.0, ans=0.2 2023-06-28 12:15:58,203 INFO [train.py:996] (3/4) Epoch 12, batch 6900, loss[loss=0.1723, simple_loss=0.2583, pruned_loss=0.04315, over 21348.00 frames. ], tot_loss[loss=0.201, simple_loss=0.2728, pruned_loss=0.06457, over 4274725.99 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:16:30,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2054106.0, ans=0.125 2023-06-28 12:16:39,821 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 6.265e+02 8.270e+02 1.384e+03 3.220e+03, threshold=1.654e+03, percent-clipped=7.0 2023-06-28 12:16:55,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2054166.0, ans=0.125 2023-06-28 12:17:23,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2054286.0, ans=0.0 2023-06-28 12:17:45,886 INFO [train.py:996] (3/4) Epoch 12, batch 6950, loss[loss=0.1681, simple_loss=0.238, pruned_loss=0.04913, over 21799.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2743, pruned_loss=0.06158, over 4274840.10 frames. ], batch size: 102, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:17:52,033 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-28 12:18:41,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2054466.0, ans=0.0 2023-06-28 12:18:44,923 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2054526.0, ans=0.0 2023-06-28 12:19:23,968 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2054586.0, ans=0.125 2023-06-28 12:19:28,442 INFO [train.py:996] (3/4) Epoch 12, batch 7000, loss[loss=0.2015, simple_loss=0.2719, pruned_loss=0.06557, over 21889.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.2793, pruned_loss=0.0637, over 4274138.19 frames. ], batch size: 107, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:19:43,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2054646.0, ans=0.0 2023-06-28 12:19:58,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2054706.0, ans=0.0 2023-06-28 12:20:05,474 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.352e+02 8.399e+02 1.085e+03 1.441e+03 2.628e+03, threshold=2.170e+03, percent-clipped=15.0 2023-06-28 12:20:35,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=15.0 2023-06-28 12:20:42,437 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.31 vs. limit=12.0 2023-06-28 12:21:16,084 INFO [train.py:996] (3/4) Epoch 12, batch 7050, loss[loss=0.1839, simple_loss=0.2771, pruned_loss=0.04538, over 21744.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2764, pruned_loss=0.06196, over 4266871.44 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:21:19,043 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-28 12:22:24,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2055126.0, ans=0.2 2023-06-28 12:22:26,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2055126.0, ans=0.0 2023-06-28 12:22:34,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2055126.0, ans=0.2 2023-06-28 12:22:54,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2055186.0, ans=0.125 2023-06-28 12:23:00,200 INFO [train.py:996] (3/4) Epoch 12, batch 7100, loss[loss=0.1658, simple_loss=0.2231, pruned_loss=0.05423, over 20820.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2809, pruned_loss=0.06321, over 4271239.80 frames. ], batch size: 608, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:23:05,762 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2055246.0, ans=0.125 2023-06-28 12:23:09,354 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2055246.0, ans=0.125 2023-06-28 12:23:10,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2055246.0, ans=0.0 2023-06-28 12:23:36,451 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 7.505e+02 1.150e+03 1.796e+03 3.717e+03, threshold=2.300e+03, percent-clipped=14.0 2023-06-28 12:24:32,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.18 vs. limit=15.0 2023-06-28 12:24:42,354 INFO [train.py:996] (3/4) Epoch 12, batch 7150, loss[loss=0.2328, simple_loss=0.3053, pruned_loss=0.08015, over 21435.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2798, pruned_loss=0.06205, over 4270487.02 frames. ], batch size: 194, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:24:57,027 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.01 vs. limit=6.0 2023-06-28 12:25:38,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2055666.0, ans=0.0 2023-06-28 12:25:49,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2055726.0, ans=0.1 2023-06-28 12:26:25,285 INFO [train.py:996] (3/4) Epoch 12, batch 7200, loss[loss=0.2006, simple_loss=0.2749, pruned_loss=0.0631, over 21717.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2827, pruned_loss=0.06435, over 4273760.40 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 12:26:52,595 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2055906.0, ans=0.125 2023-06-28 12:26:57,774 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2055906.0, ans=0.125 2023-06-28 12:27:08,265 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.252e+02 8.659e+02 1.185e+03 1.756e+03 3.819e+03, threshold=2.369e+03, percent-clipped=13.0 2023-06-28 12:27:08,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2055966.0, ans=0.0 2023-06-28 12:27:40,525 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:27:58,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2056086.0, ans=0.0 2023-06-28 12:28:01,441 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2056086.0, ans=0.1 2023-06-28 12:28:01,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2056086.0, ans=0.0 2023-06-28 12:28:12,247 INFO [train.py:996] (3/4) Epoch 12, batch 7250, loss[loss=0.1984, simple_loss=0.2671, pruned_loss=0.06488, over 15730.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2793, pruned_loss=0.06448, over 4270390.14 frames. ], batch size: 66, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:29:53,559 INFO [train.py:996] (3/4) Epoch 12, batch 7300, loss[loss=0.2253, simple_loss=0.2651, pruned_loss=0.09275, over 21500.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2735, pruned_loss=0.06396, over 4270525.12 frames. ], batch size: 512, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:30:04,521 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-28 12:30:12,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2056506.0, ans=0.0 2023-06-28 12:30:15,887 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2056506.0, ans=0.0 2023-06-28 12:30:31,644 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.740e+02 7.948e+02 1.183e+03 1.586e+03 3.750e+03, threshold=2.367e+03, percent-clipped=12.0 2023-06-28 12:30:35,806 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2056566.0, ans=0.0 2023-06-28 12:30:57,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2056626.0, ans=0.125 2023-06-28 12:31:29,881 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-28 12:31:31,964 INFO [train.py:996] (3/4) Epoch 12, batch 7350, loss[loss=0.2177, simple_loss=0.2924, pruned_loss=0.07156, over 21689.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2729, pruned_loss=0.06485, over 4266537.77 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:31:39,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2056746.0, ans=0.125 2023-06-28 12:32:46,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-28 12:33:17,322 INFO [train.py:996] (3/4) Epoch 12, batch 7400, loss[loss=0.206, simple_loss=0.3006, pruned_loss=0.0557, over 21831.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2793, pruned_loss=0.06533, over 4258529.77 frames. ], batch size: 372, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:33:27,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2057046.0, ans=0.125 2023-06-28 12:34:05,872 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.109e+02 7.290e+02 9.953e+02 1.415e+03 2.956e+03, threshold=1.991e+03, percent-clipped=1.0 2023-06-28 12:34:54,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2057286.0, ans=0.125 2023-06-28 12:35:00,566 INFO [train.py:996] (3/4) Epoch 12, batch 7450, loss[loss=0.1807, simple_loss=0.2479, pruned_loss=0.05677, over 21526.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2782, pruned_loss=0.06357, over 4251623.77 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:35:34,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2057406.0, ans=0.125 2023-06-28 12:36:17,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2057526.0, ans=0.0 2023-06-28 12:36:21,588 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-28 12:36:49,960 INFO [train.py:996] (3/4) Epoch 12, batch 7500, loss[loss=0.2842, simple_loss=0.3859, pruned_loss=0.09122, over 21669.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2838, pruned_loss=0.06558, over 4257228.20 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:37:17,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2057706.0, ans=0.1 2023-06-28 12:37:33,885 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.996e+02 7.365e+02 1.053e+03 1.699e+03 4.084e+03, threshold=2.105e+03, percent-clipped=21.0 2023-06-28 12:37:36,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-28 12:38:17,187 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-28 12:38:34,115 INFO [train.py:996] (3/4) Epoch 12, batch 7550, loss[loss=0.1889, simple_loss=0.2427, pruned_loss=0.06761, over 20324.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2899, pruned_loss=0.06498, over 4254097.52 frames. ], batch size: 703, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:39:14,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2058006.0, ans=0.1 2023-06-28 12:39:42,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2058126.0, ans=0.0 2023-06-28 12:39:46,064 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:40:16,303 INFO [train.py:996] (3/4) Epoch 12, batch 7600, loss[loss=0.2456, simple_loss=0.3027, pruned_loss=0.09428, over 21771.00 frames. ], tot_loss[loss=0.209, simple_loss=0.289, pruned_loss=0.06451, over 4262578.33 frames. ], batch size: 507, lr: 2.45e-03, grad_scale: 32.0 2023-06-28 12:40:58,887 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.988e+02 7.986e+02 1.163e+03 1.762e+03 3.955e+03, threshold=2.326e+03, percent-clipped=12.0 2023-06-28 12:41:09,623 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=22.5 2023-06-28 12:41:57,638 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.08 vs. limit=12.0 2023-06-28 12:41:57,887 INFO [train.py:996] (3/4) Epoch 12, batch 7650, loss[loss=0.23, simple_loss=0.2955, pruned_loss=0.08222, over 21802.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2881, pruned_loss=0.06574, over 4269079.34 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:42:01,730 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2058546.0, ans=0.125 2023-06-28 12:42:10,326 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2058546.0, ans=0.125 2023-06-28 12:42:57,208 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 12:43:02,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2058726.0, ans=0.1 2023-06-28 12:43:28,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2058786.0, ans=0.025 2023-06-28 12:43:46,551 INFO [train.py:996] (3/4) Epoch 12, batch 7700, loss[loss=0.2573, simple_loss=0.3252, pruned_loss=0.09467, over 21835.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2903, pruned_loss=0.06803, over 4275082.18 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-28 12:43:56,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2058846.0, ans=0.125 2023-06-28 12:43:57,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.32 vs. limit=15.0 2023-06-28 12:44:31,977 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.981e+02 7.414e+02 1.157e+03 1.590e+03 5.387e+03, threshold=2.314e+03, percent-clipped=8.0 2023-06-28 12:45:03,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2059026.0, ans=0.125 2023-06-28 12:45:08,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2059086.0, ans=0.125 2023-06-28 12:45:33,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2059086.0, ans=0.2 2023-06-28 12:45:36,620 INFO [train.py:996] (3/4) Epoch 12, batch 7750, loss[loss=0.2293, simple_loss=0.3287, pruned_loss=0.06493, over 21609.00 frames. ], tot_loss[loss=0.2143, simple_loss=0.2932, pruned_loss=0.06771, over 4272491.22 frames. ], batch size: 230, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:45:43,914 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2059146.0, ans=0.1 2023-06-28 12:45:56,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2059206.0, ans=0.125 2023-06-28 12:46:09,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2059206.0, ans=0.125 2023-06-28 12:46:13,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2059206.0, ans=0.125 2023-06-28 12:47:21,155 INFO [train.py:996] (3/4) Epoch 12, batch 7800, loss[loss=0.1804, simple_loss=0.2349, pruned_loss=0.06294, over 20823.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2969, pruned_loss=0.06872, over 4273254.40 frames. ], batch size: 609, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:47:48,038 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=22.5 2023-06-28 12:47:49,059 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2059506.0, ans=0.2 2023-06-28 12:48:00,037 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.542e+02 9.199e+02 1.440e+03 2.477e+03 5.669e+03, threshold=2.881e+03, percent-clipped=30.0 2023-06-28 12:48:16,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2059566.0, ans=0.2 2023-06-28 12:49:00,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2059686.0, ans=0.2 2023-06-28 12:49:03,642 INFO [train.py:996] (3/4) Epoch 12, batch 7850, loss[loss=0.1922, simple_loss=0.2582, pruned_loss=0.06316, over 21522.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.29, pruned_loss=0.0679, over 4275318.23 frames. ], batch size: 195, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:49:16,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2059746.0, ans=0.125 2023-06-28 12:49:34,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2059806.0, ans=0.0 2023-06-28 12:49:38,661 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-28 12:49:39,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2059806.0, ans=0.125 2023-06-28 12:49:46,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2059866.0, ans=10.0 2023-06-28 12:50:49,178 INFO [train.py:996] (3/4) Epoch 12, batch 7900, loss[loss=0.1881, simple_loss=0.2553, pruned_loss=0.06049, over 21140.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2858, pruned_loss=0.06704, over 4276063.50 frames. ], batch size: 143, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:51:21,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2060106.0, ans=0.125 2023-06-28 12:51:30,562 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.553e+02 9.216e+02 1.431e+03 2.035e+03 3.808e+03, threshold=2.862e+03, percent-clipped=8.0 2023-06-28 12:52:38,404 INFO [train.py:996] (3/4) Epoch 12, batch 7950, loss[loss=0.2106, simple_loss=0.3041, pruned_loss=0.05851, over 21787.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2888, pruned_loss=0.06621, over 4278335.70 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:52:40,609 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2060346.0, ans=0.125 2023-06-28 12:52:54,424 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2060406.0, ans=0.0 2023-06-28 12:53:50,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2060526.0, ans=0.1 2023-06-28 12:53:57,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2060526.0, ans=0.1 2023-06-28 12:54:08,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-28 12:54:24,569 INFO [train.py:996] (3/4) Epoch 12, batch 8000, loss[loss=0.2325, simple_loss=0.318, pruned_loss=0.07347, over 21902.00 frames. ], tot_loss[loss=0.2151, simple_loss=0.2935, pruned_loss=0.06832, over 4272452.52 frames. ], batch size: 316, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 12:54:58,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-28 12:55:11,048 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-28 12:55:16,459 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-06-28 12:55:18,764 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.842e+02 9.882e+02 1.672e+03 2.798e+03 5.114e+03, threshold=3.344e+03, percent-clipped=23.0 2023-06-28 12:55:34,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2060766.0, ans=0.125 2023-06-28 12:55:48,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2060826.0, ans=0.125 2023-06-28 12:56:00,988 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-28 12:56:16,331 INFO [train.py:996] (3/4) Epoch 12, batch 8050, loss[loss=0.2013, simple_loss=0.2661, pruned_loss=0.06823, over 21496.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.298, pruned_loss=0.06871, over 4267378.78 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:56:30,972 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-28 12:57:05,596 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2061066.0, ans=0.125 2023-06-28 12:58:04,708 INFO [train.py:996] (3/4) Epoch 12, batch 8100, loss[loss=0.2337, simple_loss=0.303, pruned_loss=0.08214, over 21783.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2962, pruned_loss=0.06933, over 4268256.79 frames. ], batch size: 441, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 12:58:05,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2061246.0, ans=0.07 2023-06-28 12:58:35,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=2061306.0, ans=22.5 2023-06-28 12:58:53,296 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 7.832e+02 1.202e+03 2.450e+03 5.574e+03, threshold=2.405e+03, percent-clipped=10.0 2023-06-28 12:59:56,680 INFO [train.py:996] (3/4) Epoch 12, batch 8150, loss[loss=0.2057, simple_loss=0.281, pruned_loss=0.06522, over 21502.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3018, pruned_loss=0.07012, over 4258934.97 frames. ], batch size: 195, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:00:20,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2061606.0, ans=0.0 2023-06-28 13:00:30,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2061606.0, ans=0.125 2023-06-28 13:00:37,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-28 13:01:14,688 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-28 13:01:39,549 INFO [train.py:996] (3/4) Epoch 12, batch 8200, loss[loss=0.1994, simple_loss=0.2655, pruned_loss=0.06668, over 21805.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2944, pruned_loss=0.06797, over 4265163.08 frames. ], batch size: 352, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:01:45,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2061846.0, ans=0.125 2023-06-28 13:02:20,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2061966.0, ans=0.015 2023-06-28 13:02:20,481 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2061966.0, ans=0.125 2023-06-28 13:02:21,471 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.733e+02 7.541e+02 1.166e+03 1.975e+03 4.840e+03, threshold=2.333e+03, percent-clipped=18.0 2023-06-28 13:02:25,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2061966.0, ans=0.2 2023-06-28 13:02:53,031 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-28 13:03:23,709 INFO [train.py:996] (3/4) Epoch 12, batch 8250, loss[loss=0.2064, simple_loss=0.286, pruned_loss=0.0634, over 21298.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2938, pruned_loss=0.06718, over 4270309.59 frames. ], batch size: 159, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:03:24,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2062146.0, ans=0.0 2023-06-28 13:03:52,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2062206.0, ans=0.2 2023-06-28 13:04:24,759 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2062266.0, ans=0.04949747468305833 2023-06-28 13:05:07,868 INFO [train.py:996] (3/4) Epoch 12, batch 8300, loss[loss=0.1725, simple_loss=0.251, pruned_loss=0.04705, over 21298.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2927, pruned_loss=0.06512, over 4269111.43 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:05:49,521 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.960e+02 7.792e+02 1.211e+03 1.944e+03 6.178e+03, threshold=2.421e+03, percent-clipped=18.0 2023-06-28 13:06:10,402 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.60 vs. limit=15.0 2023-06-28 13:06:55,852 INFO [train.py:996] (3/4) Epoch 12, batch 8350, loss[loss=0.1985, simple_loss=0.2837, pruned_loss=0.05671, over 21234.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2929, pruned_loss=0.06384, over 4269049.98 frames. ], batch size: 159, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:07:42,928 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-28 13:08:39,736 INFO [train.py:996] (3/4) Epoch 12, batch 8400, loss[loss=0.1512, simple_loss=0.2329, pruned_loss=0.03475, over 21715.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2895, pruned_loss=0.06114, over 4270284.43 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 13:08:42,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2063046.0, ans=0.025 2023-06-28 13:09:04,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2063106.0, ans=0.0 2023-06-28 13:09:12,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=2063166.0, ans=0.2 2023-06-28 13:09:20,749 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:09:21,786 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.051e+02 6.739e+02 1.036e+03 1.500e+03 3.619e+03, threshold=2.071e+03, percent-clipped=10.0 2023-06-28 13:09:42,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=2063226.0, ans=6.0 2023-06-28 13:10:04,286 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-28 13:10:21,248 INFO [train.py:996] (3/4) Epoch 12, batch 8450, loss[loss=0.1857, simple_loss=0.2648, pruned_loss=0.05335, over 21882.00 frames. ], tot_loss[loss=0.2051, simple_loss=0.2881, pruned_loss=0.06105, over 4271550.49 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:10:28,461 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2063346.0, ans=0.125 2023-06-28 13:10:31,871 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2063346.0, ans=0.0 2023-06-28 13:12:04,194 INFO [train.py:996] (3/4) Epoch 12, batch 8500, loss[loss=0.2201, simple_loss=0.3145, pruned_loss=0.06281, over 20901.00 frames. ], tot_loss[loss=0.2042, simple_loss=0.2842, pruned_loss=0.06211, over 4272806.52 frames. ], batch size: 607, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:12:24,943 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff3.min_abs, batch_count=2063706.0, ans=0.2 2023-06-28 13:12:30,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2063706.0, ans=0.04949747468305833 2023-06-28 13:12:49,779 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.731e+02 8.144e+02 1.139e+03 1.907e+03 5.140e+03, threshold=2.279e+03, percent-clipped=18.0 2023-06-28 13:13:23,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2063826.0, ans=0.0 2023-06-28 13:13:48,462 INFO [train.py:996] (3/4) Epoch 12, batch 8550, loss[loss=0.2254, simple_loss=0.3217, pruned_loss=0.0645, over 21798.00 frames. ], tot_loss[loss=0.2084, simple_loss=0.2885, pruned_loss=0.06418, over 4274433.15 frames. ], batch size: 316, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:13:56,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2063946.0, ans=0.0 2023-06-28 13:14:11,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2064006.0, ans=0.0 2023-06-28 13:14:37,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2064066.0, ans=0.125 2023-06-28 13:15:15,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2064126.0, ans=0.125 2023-06-28 13:15:15,338 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2064126.0, ans=0.015 2023-06-28 13:15:27,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-28 13:15:34,968 INFO [train.py:996] (3/4) Epoch 12, batch 8600, loss[loss=0.2309, simple_loss=0.3121, pruned_loss=0.07486, over 21453.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2959, pruned_loss=0.06609, over 4279896.38 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:16:28,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2064366.0, ans=0.125 2023-06-28 13:16:29,846 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.562e+02 1.076e+03 1.611e+03 2.403e+03 4.318e+03, threshold=3.223e+03, percent-clipped=30.0 2023-06-28 13:16:58,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2064426.0, ans=0.1 2023-06-28 13:17:18,550 INFO [train.py:996] (3/4) Epoch 12, batch 8650, loss[loss=0.2056, simple_loss=0.3067, pruned_loss=0.05221, over 21692.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2991, pruned_loss=0.06592, over 4281818.12 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:17:19,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2064546.0, ans=0.125 2023-06-28 13:17:49,740 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2064606.0, ans=0.125 2023-06-28 13:17:59,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2064666.0, ans=0.0 2023-06-28 13:18:59,814 INFO [train.py:996] (3/4) Epoch 12, batch 8700, loss[loss=0.175, simple_loss=0.2471, pruned_loss=0.05142, over 21634.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2905, pruned_loss=0.06289, over 4286400.84 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:19:40,890 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2064966.0, ans=0.125 2023-06-28 13:19:53,459 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.697e+02 7.863e+02 1.211e+03 1.985e+03 4.359e+03, threshold=2.422e+03, percent-clipped=4.0 2023-06-28 13:20:40,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2065146.0, ans=0.125 2023-06-28 13:20:41,886 INFO [train.py:996] (3/4) Epoch 12, batch 8750, loss[loss=0.2106, simple_loss=0.2864, pruned_loss=0.06738, over 21929.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2852, pruned_loss=0.06358, over 4272293.67 frames. ], batch size: 333, lr: 2.44e-03, grad_scale: 8.0 2023-06-28 13:21:07,511 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.31 vs. limit=22.5 2023-06-28 13:21:15,226 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2065206.0, ans=0.04949747468305833 2023-06-28 13:21:53,607 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-28 13:22:03,350 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2065326.0, ans=0.125 2023-06-28 13:22:13,414 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2065386.0, ans=0.0 2023-06-28 13:22:13,430 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2065386.0, ans=0.05 2023-06-28 13:22:31,073 INFO [train.py:996] (3/4) Epoch 12, batch 8800, loss[loss=0.2658, simple_loss=0.3483, pruned_loss=0.09162, over 21783.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2934, pruned_loss=0.06585, over 4276438.08 frames. ], batch size: 441, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:22:33,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2065446.0, ans=0.125 2023-06-28 13:22:34,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.00 vs. limit=15.0 2023-06-28 13:23:09,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2065506.0, ans=0.1 2023-06-28 13:23:26,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.165e+02 8.763e+02 1.222e+03 1.735e+03 3.559e+03, threshold=2.444e+03, percent-clipped=10.0 2023-06-28 13:23:30,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2065566.0, ans=0.0 2023-06-28 13:23:42,464 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2065626.0, ans=0.1 2023-06-28 13:23:47,820 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2065626.0, ans=0.1 2023-06-28 13:23:56,935 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-28 13:24:16,136 INFO [train.py:996] (3/4) Epoch 12, batch 8850, loss[loss=0.2049, simple_loss=0.2866, pruned_loss=0.06157, over 21743.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.3002, pruned_loss=0.06805, over 4278815.07 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:24:23,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=2065746.0, ans=6.0 2023-06-28 13:24:29,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2065746.0, ans=0.125 2023-06-28 13:25:17,003 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.21 vs. limit=10.0 2023-06-28 13:25:23,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2065926.0, ans=0.015 2023-06-28 13:25:50,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2065986.0, ans=0.125 2023-06-28 13:26:05,234 INFO [train.py:996] (3/4) Epoch 12, batch 8900, loss[loss=0.1812, simple_loss=0.2547, pruned_loss=0.05384, over 21390.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2952, pruned_loss=0.06724, over 4265637.93 frames. ], batch size: 211, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:26:26,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2066106.0, ans=0.2 2023-06-28 13:26:50,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2066166.0, ans=0.2 2023-06-28 13:26:54,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2066166.0, ans=0.1 2023-06-28 13:26:57,492 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.266e+02 7.347e+02 1.235e+03 1.790e+03 4.739e+03, threshold=2.470e+03, percent-clipped=10.0 2023-06-28 13:27:07,711 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-28 13:27:25,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2066226.0, ans=0.125 2023-06-28 13:27:42,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-28 13:27:56,295 INFO [train.py:996] (3/4) Epoch 12, batch 8950, loss[loss=0.2359, simple_loss=0.3301, pruned_loss=0.07082, over 21648.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.295, pruned_loss=0.06589, over 4251654.65 frames. ], batch size: 414, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:28:10,666 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-06-28 13:28:56,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2066526.0, ans=0.2 2023-06-28 13:29:07,171 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-28 13:29:38,958 INFO [train.py:996] (3/4) Epoch 12, batch 9000, loss[loss=0.1977, simple_loss=0.2588, pruned_loss=0.06824, over 21352.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2878, pruned_loss=0.06538, over 4259817.84 frames. ], batch size: 144, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:29:38,958 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 13:29:52,349 INFO [zipformer.py:1728] (3/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.8747, 4.3675, 4.5826, 4.1051], device='cuda:3') 2023-06-28 13:29:59,528 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2628, simple_loss=0.3535, pruned_loss=0.086, over 1796401.00 frames. 2023-06-28 13:29:59,529 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 13:30:05,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2066646.0, ans=0.1 2023-06-28 13:30:05,478 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2066646.0, ans=0.0 2023-06-28 13:30:37,800 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-28 13:30:44,984 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.661e+02 7.055e+02 9.403e+02 1.588e+03 4.919e+03, threshold=1.881e+03, percent-clipped=11.0 2023-06-28 13:31:26,741 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2066886.0, ans=0.05 2023-06-28 13:31:34,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2066886.0, ans=0.125 2023-06-28 13:31:41,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2066886.0, ans=0.0 2023-06-28 13:31:44,375 INFO [train.py:996] (3/4) Epoch 12, batch 9050, loss[loss=0.2153, simple_loss=0.2973, pruned_loss=0.06666, over 21786.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.283, pruned_loss=0.06219, over 4256307.60 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:31:55,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2066946.0, ans=0.125 2023-06-28 13:32:02,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2067006.0, ans=0.0 2023-06-28 13:32:14,582 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2067006.0, ans=0.2 2023-06-28 13:33:03,750 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2067126.0, ans=0.1 2023-06-28 13:33:30,607 INFO [train.py:996] (3/4) Epoch 12, batch 9100, loss[loss=0.1895, simple_loss=0.2913, pruned_loss=0.04382, over 21757.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2886, pruned_loss=0.06461, over 4259827.07 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:34:19,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2067366.0, ans=0.0 2023-06-28 13:34:22,065 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.166e+02 1.280e+03 2.185e+03 3.198e+03 4.785e+03, threshold=4.371e+03, percent-clipped=55.0 2023-06-28 13:35:16,204 INFO [train.py:996] (3/4) Epoch 12, batch 9150, loss[loss=0.2056, simple_loss=0.2988, pruned_loss=0.0562, over 21663.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2937, pruned_loss=0.06343, over 4249897.42 frames. ], batch size: 263, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:35:32,647 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-28 13:36:32,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-06-28 13:36:59,436 INFO [train.py:996] (3/4) Epoch 12, batch 9200, loss[loss=0.2186, simple_loss=0.3051, pruned_loss=0.06603, over 19858.00 frames. ], tot_loss[loss=0.2094, simple_loss=0.295, pruned_loss=0.06191, over 4258758.41 frames. ], batch size: 703, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 13:37:23,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2067846.0, ans=0.125 2023-06-28 13:37:40,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2067906.0, ans=0.125 2023-06-28 13:38:01,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.761e+02 9.017e+02 1.569e+03 2.101e+03 3.767e+03, threshold=3.138e+03, percent-clipped=0.0 2023-06-28 13:38:10,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.05 vs. limit=15.0 2023-06-28 13:38:37,963 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-28 13:38:48,657 INFO [train.py:996] (3/4) Epoch 12, batch 9250, loss[loss=0.261, simple_loss=0.3332, pruned_loss=0.09437, over 21401.00 frames. ], tot_loss[loss=0.2136, simple_loss=0.2974, pruned_loss=0.06494, over 4254807.80 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:39:23,805 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:39:50,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2068266.0, ans=0.1 2023-06-28 13:39:58,542 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-28 13:40:39,787 INFO [train.py:996] (3/4) Epoch 12, batch 9300, loss[loss=0.2268, simple_loss=0.3228, pruned_loss=0.06544, over 21652.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.294, pruned_loss=0.06456, over 4253368.53 frames. ], batch size: 263, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:40:45,934 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=12.0 2023-06-28 13:41:32,623 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.166e+02 1.033e+03 1.685e+03 2.661e+03 5.053e+03, threshold=3.371e+03, percent-clipped=15.0 2023-06-28 13:41:41,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2068626.0, ans=0.0 2023-06-28 13:41:57,360 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2068686.0, ans=0.2 2023-06-28 13:42:25,449 INFO [train.py:996] (3/4) Epoch 12, batch 9350, loss[loss=0.2607, simple_loss=0.3381, pruned_loss=0.09168, over 21454.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.3001, pruned_loss=0.06609, over 4254839.46 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:43:03,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-28 13:43:34,044 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2068926.0, ans=0.125 2023-06-28 13:44:15,561 INFO [train.py:996] (3/4) Epoch 12, batch 9400, loss[loss=0.2261, simple_loss=0.3457, pruned_loss=0.05327, over 19759.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.3021, pruned_loss=0.06702, over 4253757.01 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:44:16,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2069046.0, ans=0.0 2023-06-28 13:44:28,059 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.31 vs. limit=10.0 2023-06-28 13:44:55,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2069166.0, ans=0.1 2023-06-28 13:45:01,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.886e+02 7.931e+02 1.125e+03 1.716e+03 3.605e+03, threshold=2.249e+03, percent-clipped=1.0 2023-06-28 13:45:02,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2069166.0, ans=0.2 2023-06-28 13:45:07,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2069166.0, ans=0.0 2023-06-28 13:45:21,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2069226.0, ans=0.1 2023-06-28 13:45:23,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2069226.0, ans=0.04949747468305833 2023-06-28 13:45:33,959 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:45:36,254 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-06-28 13:45:54,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=22.5 2023-06-28 13:45:55,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2069286.0, ans=0.0 2023-06-28 13:45:58,261 INFO [train.py:996] (3/4) Epoch 12, batch 9450, loss[loss=0.1943, simple_loss=0.2622, pruned_loss=0.0632, over 21522.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2952, pruned_loss=0.06563, over 4249010.97 frames. ], batch size: 391, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:46:13,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2069346.0, ans=0.1 2023-06-28 13:46:34,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.21 vs. limit=10.0 2023-06-28 13:47:41,509 INFO [train.py:996] (3/4) Epoch 12, batch 9500, loss[loss=0.1713, simple_loss=0.2465, pruned_loss=0.04807, over 21611.00 frames. ], tot_loss[loss=0.2095, simple_loss=0.2899, pruned_loss=0.06452, over 4245637.12 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:48:38,315 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.464e+02 8.117e+02 1.177e+03 1.570e+03 4.123e+03, threshold=2.354e+03, percent-clipped=16.0 2023-06-28 13:48:53,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2069826.0, ans=0.125 2023-06-28 13:48:54,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-28 13:49:24,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2069946.0, ans=0.1 2023-06-28 13:49:25,115 INFO [train.py:996] (3/4) Epoch 12, batch 9550, loss[loss=0.2423, simple_loss=0.315, pruned_loss=0.08479, over 21185.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2912, pruned_loss=0.06577, over 4253492.11 frames. ], batch size: 143, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:49:35,878 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 13:49:42,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2069946.0, ans=0.125 2023-06-28 13:50:11,377 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=22.5 2023-06-28 13:50:22,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2070066.0, ans=0.125 2023-06-28 13:50:51,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-28 13:51:04,173 INFO [train.py:996] (3/4) Epoch 12, batch 9600, loss[loss=0.2713, simple_loss=0.3453, pruned_loss=0.09864, over 21505.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2953, pruned_loss=0.06795, over 4262648.91 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 13:51:21,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2070246.0, ans=0.1 2023-06-28 13:51:32,590 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2070306.0, ans=0.07 2023-06-28 13:52:01,560 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.070e+02 8.059e+02 1.139e+03 1.979e+03 4.989e+03, threshold=2.277e+03, percent-clipped=18.0 2023-06-28 13:52:18,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2070426.0, ans=0.125 2023-06-28 13:52:52,011 INFO [train.py:996] (3/4) Epoch 12, batch 9650, loss[loss=0.2916, simple_loss=0.3479, pruned_loss=0.1177, over 21388.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2952, pruned_loss=0.06819, over 4271893.71 frames. ], batch size: 508, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:53:52,579 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2070726.0, ans=0.125 2023-06-28 13:53:54,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2070726.0, ans=0.2 2023-06-28 13:54:02,918 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2070726.0, ans=0.125 2023-06-28 13:54:13,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2070786.0, ans=0.125 2023-06-28 13:54:30,932 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-28 13:54:36,709 INFO [train.py:996] (3/4) Epoch 12, batch 9700, loss[loss=0.1938, simple_loss=0.2786, pruned_loss=0.05456, over 21656.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2986, pruned_loss=0.0689, over 4266615.15 frames. ], batch size: 230, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:54:42,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2070846.0, ans=0.125 2023-06-28 13:55:22,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2070966.0, ans=0.125 2023-06-28 13:55:22,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2070966.0, ans=0.125 2023-06-28 13:55:29,990 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.817e+02 8.034e+02 1.157e+03 1.856e+03 3.207e+03, threshold=2.314e+03, percent-clipped=13.0 2023-06-28 13:55:37,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2071026.0, ans=0.0 2023-06-28 13:55:57,799 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.69 vs. limit=10.0 2023-06-28 13:56:19,103 INFO [train.py:996] (3/4) Epoch 12, batch 9750, loss[loss=0.2011, simple_loss=0.267, pruned_loss=0.06764, over 21405.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2939, pruned_loss=0.06789, over 4270535.38 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:57:24,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2071326.0, ans=0.0 2023-06-28 13:57:56,318 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-28 13:57:56,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=2071386.0, ans=15.0 2023-06-28 13:58:01,398 INFO [train.py:996] (3/4) Epoch 12, batch 9800, loss[loss=0.2115, simple_loss=0.2873, pruned_loss=0.06785, over 21746.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2915, pruned_loss=0.06763, over 4261477.06 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 13:58:23,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2071506.0, ans=0.1 2023-06-28 13:58:33,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2071506.0, ans=0.125 2023-06-28 13:58:53,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2071566.0, ans=0.1 2023-06-28 13:58:54,274 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.350e+02 9.272e+02 1.641e+03 2.423e+03 5.120e+03, threshold=3.282e+03, percent-clipped=25.0 2023-06-28 13:58:58,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2071566.0, ans=10.0 2023-06-28 13:59:37,409 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2071686.0, ans=0.0 2023-06-28 13:59:39,322 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2071686.0, ans=0.125 2023-06-28 13:59:43,788 INFO [train.py:996] (3/4) Epoch 12, batch 9850, loss[loss=0.2188, simple_loss=0.3105, pruned_loss=0.0635, over 21404.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2881, pruned_loss=0.06699, over 4262615.96 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:00:02,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2071806.0, ans=0.125 2023-06-28 14:00:38,053 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2071866.0, ans=0.1 2023-06-28 14:01:03,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2071986.0, ans=0.125 2023-06-28 14:01:25,449 INFO [train.py:996] (3/4) Epoch 12, batch 9900, loss[loss=0.2283, simple_loss=0.3033, pruned_loss=0.07661, over 21804.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2858, pruned_loss=0.06639, over 4259071.67 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:01:31,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2072046.0, ans=0.125 2023-06-28 14:01:32,738 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2072046.0, ans=0.125 2023-06-28 14:02:08,417 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2072166.0, ans=0.015 2023-06-28 14:02:08,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2072166.0, ans=0.125 2023-06-28 14:02:17,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2072166.0, ans=0.025 2023-06-28 14:02:19,780 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.295e+02 1.063e+03 1.503e+03 2.102e+03 4.753e+03, threshold=3.006e+03, percent-clipped=10.0 2023-06-28 14:03:05,236 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2072286.0, ans=0.0 2023-06-28 14:03:05,667 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-28 14:03:09,588 INFO [train.py:996] (3/4) Epoch 12, batch 9950, loss[loss=0.1898, simple_loss=0.2549, pruned_loss=0.06232, over 21595.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2865, pruned_loss=0.06745, over 4249400.69 frames. ], batch size: 263, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:04:10,473 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2072526.0, ans=0.0 2023-06-28 14:04:52,800 INFO [train.py:996] (3/4) Epoch 12, batch 10000, loss[loss=0.2051, simple_loss=0.2785, pruned_loss=0.06588, over 21837.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2835, pruned_loss=0.06729, over 4253139.80 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 14:04:53,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2072646.0, ans=0.0 2023-06-28 14:04:57,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.22 vs. limit=22.5 2023-06-28 14:05:32,957 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2072706.0, ans=0.125 2023-06-28 14:05:50,331 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.893e+02 6.803e+02 1.015e+03 1.604e+03 3.420e+03, threshold=2.029e+03, percent-clipped=1.0 2023-06-28 14:06:02,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2072826.0, ans=0.0 2023-06-28 14:06:36,066 INFO [train.py:996] (3/4) Epoch 12, batch 10050, loss[loss=0.1966, simple_loss=0.2721, pruned_loss=0.06055, over 21543.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2841, pruned_loss=0.06767, over 4261068.67 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:06:40,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2072946.0, ans=0.1 2023-06-28 14:06:56,896 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2073006.0, ans=0.125 2023-06-28 14:07:19,167 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2073066.0, ans=0.1 2023-06-28 14:07:21,429 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=22.5 2023-06-28 14:07:31,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2073066.0, ans=0.125 2023-06-28 14:07:49,708 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2073126.0, ans=0.125 2023-06-28 14:08:21,350 INFO [train.py:996] (3/4) Epoch 12, batch 10100, loss[loss=0.2489, simple_loss=0.3598, pruned_loss=0.06897, over 19860.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2837, pruned_loss=0.06606, over 4260185.62 frames. ], batch size: 703, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:08:32,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.11 vs. limit=12.0 2023-06-28 14:08:35,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=12.0 2023-06-28 14:08:55,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2073306.0, ans=0.1 2023-06-28 14:09:09,259 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.88 vs. limit=15.0 2023-06-28 14:09:14,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-28 14:09:21,190 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.696e+02 9.806e+02 1.615e+03 2.401e+03 4.786e+03, threshold=3.230e+03, percent-clipped=36.0 2023-06-28 14:09:21,769 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2073366.0, ans=0.0 2023-06-28 14:09:25,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2073426.0, ans=0.0 2023-06-28 14:10:06,040 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:10:10,004 INFO [train.py:996] (3/4) Epoch 12, batch 10150, loss[loss=0.2164, simple_loss=0.2971, pruned_loss=0.06782, over 21819.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2877, pruned_loss=0.06776, over 4269533.11 frames. ], batch size: 118, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:10:45,125 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2073606.0, ans=0.2 2023-06-28 14:10:55,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2073666.0, ans=0.125 2023-06-28 14:11:22,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2073726.0, ans=0.125 2023-06-28 14:11:34,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2073786.0, ans=0.125 2023-06-28 14:11:35,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2073786.0, ans=0.0 2023-06-28 14:11:37,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2073786.0, ans=0.0 2023-06-28 14:11:52,838 INFO [train.py:996] (3/4) Epoch 12, batch 10200, loss[loss=0.1846, simple_loss=0.2772, pruned_loss=0.04594, over 21724.00 frames. ], tot_loss[loss=0.2114, simple_loss=0.2887, pruned_loss=0.06702, over 4262781.79 frames. ], batch size: 352, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:11:56,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2073846.0, ans=0.2 2023-06-28 14:11:56,947 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2073846.0, ans=0.2 2023-06-28 14:12:45,581 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-28 14:12:46,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2073966.0, ans=0.035 2023-06-28 14:12:47,765 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.411e+02 8.616e+02 1.269e+03 2.043e+03 3.610e+03, threshold=2.539e+03, percent-clipped=1.0 2023-06-28 14:13:06,567 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2074026.0, ans=0.0 2023-06-28 14:13:13,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2074086.0, ans=0.125 2023-06-28 14:13:23,199 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:13:40,917 INFO [train.py:996] (3/4) Epoch 12, batch 10250, loss[loss=0.1361, simple_loss=0.2226, pruned_loss=0.02484, over 21402.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2836, pruned_loss=0.06212, over 4258194.89 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:14:24,136 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.47 vs. limit=15.0 2023-06-28 14:14:49,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2074326.0, ans=0.0 2023-06-28 14:15:22,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2074386.0, ans=0.0 2023-06-28 14:15:25,088 INFO [train.py:996] (3/4) Epoch 12, batch 10300, loss[loss=0.1966, simple_loss=0.2474, pruned_loss=0.07288, over 20331.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2849, pruned_loss=0.06299, over 4258195.50 frames. ], batch size: 703, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:15:32,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2074446.0, ans=0.125 2023-06-28 14:15:32,761 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:16:02,190 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2074506.0, ans=0.2 2023-06-28 14:16:22,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 3.719e+02 6.981e+02 1.162e+03 1.847e+03 5.403e+03, threshold=2.324e+03, percent-clipped=10.0 2023-06-28 14:16:26,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2074626.0, ans=0.125 2023-06-28 14:17:11,844 INFO [train.py:996] (3/4) Epoch 12, batch 10350, loss[loss=0.1889, simple_loss=0.2681, pruned_loss=0.05484, over 21734.00 frames. ], tot_loss[loss=0.2068, simple_loss=0.2875, pruned_loss=0.06308, over 4258291.40 frames. ], batch size: 282, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:17:47,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2074806.0, ans=0.0 2023-06-28 14:17:54,665 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2074866.0, ans=0.04949747468305833 2023-06-28 14:18:23,016 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2074926.0, ans=0.1 2023-06-28 14:18:38,168 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2074986.0, ans=0.1 2023-06-28 14:18:43,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2074986.0, ans=0.0 2023-06-28 14:19:00,728 INFO [train.py:996] (3/4) Epoch 12, batch 10400, loss[loss=0.2023, simple_loss=0.2856, pruned_loss=0.05957, over 21921.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2829, pruned_loss=0.06243, over 4267830.54 frames. ], batch size: 373, lr: 2.44e-03, grad_scale: 32.0 2023-06-28 14:19:06,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2075046.0, ans=0.0 2023-06-28 14:19:15,336 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2075046.0, ans=0.2 2023-06-28 14:19:29,993 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2075106.0, ans=0.125 2023-06-28 14:19:30,048 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2075106.0, ans=10.0 2023-06-28 14:19:35,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2075106.0, ans=0.125 2023-06-28 14:19:58,464 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.939e+02 1.030e+03 1.665e+03 2.817e+03 5.984e+03, threshold=3.330e+03, percent-clipped=36.0 2023-06-28 14:20:00,925 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2075226.0, ans=0.0 2023-06-28 14:20:41,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2075286.0, ans=0.2 2023-06-28 14:20:46,347 INFO [train.py:996] (3/4) Epoch 12, batch 10450, loss[loss=0.2494, simple_loss=0.338, pruned_loss=0.08039, over 21644.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2877, pruned_loss=0.06584, over 4274131.49 frames. ], batch size: 414, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:20:58,602 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2075346.0, ans=0.035 2023-06-28 14:21:18,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2075406.0, ans=0.2 2023-06-28 14:21:48,998 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=15.0 2023-06-28 14:22:22,250 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2075586.0, ans=0.125 2023-06-28 14:22:34,289 INFO [train.py:996] (3/4) Epoch 12, batch 10500, loss[loss=0.2369, simple_loss=0.2867, pruned_loss=0.09349, over 21407.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2871, pruned_loss=0.06441, over 4266095.23 frames. ], batch size: 508, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:22:48,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2075646.0, ans=0.125 2023-06-28 14:23:07,875 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=2075706.0, ans=0.05 2023-06-28 14:23:29,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2075766.0, ans=0.0 2023-06-28 14:23:30,262 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.346e+02 7.811e+02 1.278e+03 1.903e+03 4.033e+03, threshold=2.556e+03, percent-clipped=2.0 2023-06-28 14:23:32,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2075826.0, ans=0.1 2023-06-28 14:24:16,715 INFO [train.py:996] (3/4) Epoch 12, batch 10550, loss[loss=0.1739, simple_loss=0.2394, pruned_loss=0.05422, over 21480.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2827, pruned_loss=0.06372, over 4264547.43 frames. ], batch size: 195, lr: 2.44e-03, grad_scale: 16.0 2023-06-28 14:25:59,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2076246.0, ans=0.1 2023-06-28 14:26:00,758 INFO [train.py:996] (3/4) Epoch 12, batch 10600, loss[loss=0.19, simple_loss=0.2834, pruned_loss=0.04835, over 21686.00 frames. ], tot_loss[loss=0.1999, simple_loss=0.2769, pruned_loss=0.0615, over 4268077.97 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:26:10,396 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2076246.0, ans=0.125 2023-06-28 14:26:12,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2076246.0, ans=0.1 2023-06-28 14:26:59,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.873e+02 6.316e+02 8.507e+02 1.506e+03 2.988e+03, threshold=1.701e+03, percent-clipped=6.0 2023-06-28 14:27:31,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2076486.0, ans=0.125 2023-06-28 14:27:37,816 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2076486.0, ans=0.125 2023-06-28 14:27:40,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.65 vs. limit=15.0 2023-06-28 14:27:46,051 INFO [train.py:996] (3/4) Epoch 12, batch 10650, loss[loss=0.2074, simple_loss=0.2892, pruned_loss=0.06281, over 21586.00 frames. ], tot_loss[loss=0.2001, simple_loss=0.2787, pruned_loss=0.06071, over 4265751.00 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:28:31,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2076666.0, ans=0.1 2023-06-28 14:29:11,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2076786.0, ans=0.125 2023-06-28 14:29:29,914 INFO [train.py:996] (3/4) Epoch 12, batch 10700, loss[loss=0.2153, simple_loss=0.2944, pruned_loss=0.06813, over 21498.00 frames. ], tot_loss[loss=0.1997, simple_loss=0.2773, pruned_loss=0.06104, over 4265021.87 frames. ], batch size: 194, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:30:32,558 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.676e+02 7.956e+02 1.277e+03 1.864e+03 4.109e+03, threshold=2.555e+03, percent-clipped=30.0 2023-06-28 14:31:09,425 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-28 14:31:21,313 INFO [train.py:996] (3/4) Epoch 12, batch 10750, loss[loss=0.1851, simple_loss=0.2571, pruned_loss=0.0566, over 20770.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2866, pruned_loss=0.06421, over 4265204.94 frames. ], batch size: 608, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:32:03,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2077266.0, ans=0.0 2023-06-28 14:33:10,858 INFO [train.py:996] (3/4) Epoch 12, batch 10800, loss[loss=0.2115, simple_loss=0.3291, pruned_loss=0.04698, over 20721.00 frames. ], tot_loss[loss=0.2104, simple_loss=0.2906, pruned_loss=0.06507, over 4260066.51 frames. ], batch size: 607, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 14:33:39,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2077506.0, ans=0.1 2023-06-28 14:34:08,050 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.630e+02 8.272e+02 1.352e+03 2.286e+03 6.133e+03, threshold=2.703e+03, percent-clipped=22.0 2023-06-28 14:34:34,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-28 14:34:54,574 INFO [train.py:996] (3/4) Epoch 12, batch 10850, loss[loss=0.1754, simple_loss=0.251, pruned_loss=0.0499, over 15490.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2923, pruned_loss=0.06558, over 4249782.26 frames. ], batch size: 61, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:35:15,209 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-28 14:35:37,839 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2077866.0, ans=0.125 2023-06-28 14:35:47,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2077866.0, ans=0.0 2023-06-28 14:35:52,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=2077866.0, ans=0.1 2023-06-28 14:35:55,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2077926.0, ans=0.0 2023-06-28 14:36:10,986 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-28 14:36:28,158 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.20 vs. limit=10.0 2023-06-28 14:36:38,562 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.87 vs. limit=22.5 2023-06-28 14:36:38,970 INFO [train.py:996] (3/4) Epoch 12, batch 10900, loss[loss=0.1705, simple_loss=0.2435, pruned_loss=0.04878, over 21370.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2858, pruned_loss=0.06443, over 4250721.56 frames. ], batch size: 144, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:37:12,165 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2078106.0, ans=0.1 2023-06-28 14:37:34,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2078166.0, ans=0.2 2023-06-28 14:37:36,121 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.311e+02 7.551e+02 9.581e+02 1.368e+03 2.722e+03, threshold=1.916e+03, percent-clipped=1.0 2023-06-28 14:38:20,663 INFO [train.py:996] (3/4) Epoch 12, batch 10950, loss[loss=0.2373, simple_loss=0.3333, pruned_loss=0.07071, over 20847.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2826, pruned_loss=0.0628, over 4254488.70 frames. ], batch size: 608, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:38:22,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2078346.0, ans=0.0 2023-06-28 14:38:26,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2078346.0, ans=0.1 2023-06-28 14:39:08,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2078466.0, ans=0.125 2023-06-28 14:39:12,285 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.91 vs. limit=6.0 2023-06-28 14:39:21,941 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2078526.0, ans=0.125 2023-06-28 14:39:33,204 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2078526.0, ans=0.125 2023-06-28 14:40:04,308 INFO [train.py:996] (3/4) Epoch 12, batch 11000, loss[loss=0.2078, simple_loss=0.28, pruned_loss=0.06779, over 21561.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2822, pruned_loss=0.06285, over 4254004.76 frames. ], batch size: 212, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:40:35,133 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=15.0 2023-06-28 14:41:02,130 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.714e+02 8.398e+02 1.287e+03 1.832e+03 5.305e+03, threshold=2.574e+03, percent-clipped=21.0 2023-06-28 14:41:10,412 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2078826.0, ans=0.125 2023-06-28 14:41:45,731 INFO [train.py:996] (3/4) Epoch 12, batch 11050, loss[loss=0.1959, simple_loss=0.2562, pruned_loss=0.06778, over 21388.00 frames. ], tot_loss[loss=0.2039, simple_loss=0.2808, pruned_loss=0.06345, over 4257069.06 frames. ], batch size: 177, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:41:48,282 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2078946.0, ans=0.125 2023-06-28 14:42:28,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2079066.0, ans=0.125 2023-06-28 14:42:53,002 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:42:54,683 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2079126.0, ans=0.09899494936611666 2023-06-28 14:43:20,186 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-28 14:43:21,022 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2079186.0, ans=0.07 2023-06-28 14:43:23,999 INFO [train.py:996] (3/4) Epoch 12, batch 11100, loss[loss=0.1906, simple_loss=0.2595, pruned_loss=0.06089, over 21781.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2776, pruned_loss=0.06305, over 4257164.02 frames. ], batch size: 371, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:44:14,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2079366.0, ans=0.0 2023-06-28 14:44:22,372 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.287e+02 7.124e+02 1.046e+03 1.474e+03 3.228e+03, threshold=2.092e+03, percent-clipped=3.0 2023-06-28 14:44:53,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2079486.0, ans=0.07 2023-06-28 14:45:06,626 INFO [train.py:996] (3/4) Epoch 12, batch 11150, loss[loss=0.1818, simple_loss=0.2548, pruned_loss=0.05442, over 21842.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2748, pruned_loss=0.06283, over 4245467.55 frames. ], batch size: 318, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:45:23,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2079546.0, ans=0.125 2023-06-28 14:46:04,311 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-28 14:46:12,900 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.56 vs. limit=15.0 2023-06-28 14:46:49,367 INFO [train.py:996] (3/4) Epoch 12, batch 11200, loss[loss=0.1806, simple_loss=0.2555, pruned_loss=0.05282, over 21532.00 frames. ], tot_loss[loss=0.1992, simple_loss=0.2735, pruned_loss=0.06248, over 4251738.95 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 14:47:29,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2079906.0, ans=0.025 2023-06-28 14:47:41,178 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-28 14:47:48,165 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.550e+02 9.831e+02 1.329e+03 1.720e+03 5.358e+03, threshold=2.658e+03, percent-clipped=16.0 2023-06-28 14:48:30,193 INFO [train.py:996] (3/4) Epoch 12, batch 11250, loss[loss=0.1929, simple_loss=0.2787, pruned_loss=0.05353, over 21871.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2725, pruned_loss=0.06328, over 4255854.53 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:49:00,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2080206.0, ans=0.0 2023-06-28 14:49:07,079 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2080206.0, ans=0.1 2023-06-28 14:49:31,937 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2080326.0, ans=0.0 2023-06-28 14:49:40,054 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2080326.0, ans=0.09899494936611666 2023-06-28 14:50:12,574 INFO [train.py:996] (3/4) Epoch 12, batch 11300, loss[loss=0.1864, simple_loss=0.261, pruned_loss=0.05593, over 21194.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2747, pruned_loss=0.06321, over 4255221.61 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 14:50:20,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2080446.0, ans=0.125 2023-06-28 14:50:22,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2080446.0, ans=0.125 2023-06-28 14:51:14,610 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.978e+02 7.526e+02 1.048e+03 1.657e+03 3.488e+03, threshold=2.097e+03, percent-clipped=3.0 2023-06-28 14:51:16,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2080626.0, ans=0.125 2023-06-28 14:51:36,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2080626.0, ans=0.125 2023-06-28 14:51:41,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2080686.0, ans=0.1 2023-06-28 14:51:55,957 INFO [train.py:996] (3/4) Epoch 12, batch 11350, loss[loss=0.2351, simple_loss=0.3183, pruned_loss=0.0759, over 21774.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2762, pruned_loss=0.06249, over 4265761.02 frames. ], batch size: 118, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:52:14,950 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2080746.0, ans=0.125 2023-06-28 14:53:51,202 INFO [train.py:996] (3/4) Epoch 12, batch 11400, loss[loss=0.2078, simple_loss=0.3041, pruned_loss=0.0557, over 21721.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2813, pruned_loss=0.06424, over 4270894.46 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:54:06,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2081106.0, ans=0.1 2023-06-28 14:54:51,838 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.874e+02 7.904e+02 1.165e+03 1.837e+03 4.416e+03, threshold=2.330e+03, percent-clipped=18.0 2023-06-28 14:55:06,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2081286.0, ans=0.0 2023-06-28 14:55:24,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2081286.0, ans=0.0 2023-06-28 14:55:34,010 INFO [train.py:996] (3/4) Epoch 12, batch 11450, loss[loss=0.1866, simple_loss=0.2676, pruned_loss=0.05279, over 21468.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2845, pruned_loss=0.06404, over 4277461.58 frames. ], batch size: 211, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:55:34,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2081346.0, ans=0.0 2023-06-28 14:56:11,631 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:56:41,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2081526.0, ans=0.0 2023-06-28 14:56:41,965 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2081526.0, ans=0.0 2023-06-28 14:57:03,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2081586.0, ans=0.1 2023-06-28 14:57:08,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=2081586.0, ans=22.5 2023-06-28 14:57:18,106 INFO [train.py:996] (3/4) Epoch 12, batch 11500, loss[loss=0.1914, simple_loss=0.2908, pruned_loss=0.04594, over 21865.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2876, pruned_loss=0.06529, over 4273821.26 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:57:21,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2081646.0, ans=0.1 2023-06-28 14:58:20,247 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.484e+02 9.519e+02 1.302e+03 1.957e+03 4.452e+03, threshold=2.605e+03, percent-clipped=16.0 2023-06-28 14:58:42,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2081826.0, ans=0.125 2023-06-28 14:58:44,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2081886.0, ans=0.125 2023-06-28 14:58:46,186 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 14:58:56,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2081886.0, ans=0.1 2023-06-28 14:59:03,072 INFO [train.py:996] (3/4) Epoch 12, batch 11550, loss[loss=0.2659, simple_loss=0.3737, pruned_loss=0.07911, over 21658.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2944, pruned_loss=0.06628, over 4277618.83 frames. ], batch size: 414, lr: 2.43e-03, grad_scale: 8.0 2023-06-28 14:59:17,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2081946.0, ans=0.125 2023-06-28 14:59:26,282 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-06-28 14:59:46,277 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2082066.0, ans=0.125 2023-06-28 15:00:00,235 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.28 vs. limit=15.0 2023-06-28 15:00:39,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2082186.0, ans=0.1 2023-06-28 15:00:46,578 INFO [train.py:996] (3/4) Epoch 12, batch 11600, loss[loss=0.247, simple_loss=0.332, pruned_loss=0.081, over 21307.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3076, pruned_loss=0.06836, over 4271351.50 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:01:03,527 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2082246.0, ans=0.0 2023-06-28 15:01:12,163 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2082306.0, ans=0.1 2023-06-28 15:01:50,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2082366.0, ans=0.1 2023-06-28 15:01:58,182 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.568e+02 8.664e+02 1.450e+03 2.268e+03 5.007e+03, threshold=2.901e+03, percent-clipped=18.0 2023-06-28 15:02:10,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2082426.0, ans=0.125 2023-06-28 15:02:35,268 INFO [train.py:996] (3/4) Epoch 12, batch 11650, loss[loss=0.2122, simple_loss=0.2967, pruned_loss=0.06382, over 21646.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3151, pruned_loss=0.06935, over 4277894.92 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:02:53,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2082546.0, ans=0.035 2023-06-28 15:03:07,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2082606.0, ans=0.2 2023-06-28 15:04:04,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2082786.0, ans=0.125 2023-06-28 15:04:06,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2082786.0, ans=0.0 2023-06-28 15:04:16,901 INFO [train.py:996] (3/4) Epoch 12, batch 11700, loss[loss=0.2123, simple_loss=0.2846, pruned_loss=0.07003, over 21750.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3089, pruned_loss=0.06805, over 4276634.37 frames. ], batch size: 112, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:04:35,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2082846.0, ans=0.0 2023-06-28 15:05:22,741 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.474e+02 9.550e+02 1.552e+03 2.202e+03 4.902e+03, threshold=3.105e+03, percent-clipped=9.0 2023-06-28 15:06:04,483 INFO [train.py:996] (3/4) Epoch 12, batch 11750, loss[loss=0.2168, simple_loss=0.2937, pruned_loss=0.06994, over 21494.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2985, pruned_loss=0.06729, over 4267465.92 frames. ], batch size: 389, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:07:00,834 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.23 vs. limit=10.0 2023-06-28 15:07:16,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-28 15:07:42,706 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-06-28 15:07:47,887 INFO [train.py:996] (3/4) Epoch 12, batch 11800, loss[loss=0.1978, simple_loss=0.2932, pruned_loss=0.05121, over 21622.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2993, pruned_loss=0.06853, over 4269518.71 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:07:50,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2083446.0, ans=0.0 2023-06-28 15:08:05,281 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2083506.0, ans=0.125 2023-06-28 15:08:08,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2083506.0, ans=0.0 2023-06-28 15:08:16,492 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2083506.0, ans=0.125 2023-06-28 15:08:31,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-28 15:08:48,880 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.444e+02 9.330e+02 1.436e+03 2.225e+03 5.022e+03, threshold=2.872e+03, percent-clipped=11.0 2023-06-28 15:08:51,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2083626.0, ans=0.0 2023-06-28 15:09:26,696 INFO [train.py:996] (3/4) Epoch 12, batch 11850, loss[loss=0.1914, simple_loss=0.2749, pruned_loss=0.05395, over 21840.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2987, pruned_loss=0.06787, over 4279547.29 frames. ], batch size: 118, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:10:23,413 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=15.0 2023-06-28 15:10:56,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2083986.0, ans=0.2 2023-06-28 15:11:07,496 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2083986.0, ans=0.0 2023-06-28 15:11:10,424 INFO [train.py:996] (3/4) Epoch 12, batch 11900, loss[loss=0.206, simple_loss=0.2941, pruned_loss=0.05898, over 21545.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2994, pruned_loss=0.06625, over 4275858.77 frames. ], batch size: 389, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:11:20,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2084046.0, ans=0.125 2023-06-28 15:11:20,138 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2084046.0, ans=0.125 2023-06-28 15:12:07,400 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2084166.0, ans=0.125 2023-06-28 15:12:13,558 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.185e+02 7.198e+02 9.065e+02 1.390e+03 3.282e+03, threshold=1.813e+03, percent-clipped=3.0 2023-06-28 15:12:29,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2084226.0, ans=0.0 2023-06-28 15:12:35,924 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-28 15:12:54,375 INFO [train.py:996] (3/4) Epoch 12, batch 11950, loss[loss=0.1916, simple_loss=0.2868, pruned_loss=0.04817, over 21806.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.3004, pruned_loss=0.06421, over 4273785.40 frames. ], batch size: 371, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:13:14,032 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2084346.0, ans=0.125 2023-06-28 15:13:17,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2084346.0, ans=0.0 2023-06-28 15:13:18,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2084406.0, ans=0.0 2023-06-28 15:13:42,591 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2084466.0, ans=0.1 2023-06-28 15:14:23,061 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2084586.0, ans=0.125 2023-06-28 15:14:29,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2084586.0, ans=0.0 2023-06-28 15:14:29,840 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2084586.0, ans=0.07 2023-06-28 15:14:35,781 INFO [train.py:996] (3/4) Epoch 12, batch 12000, loss[loss=0.1926, simple_loss=0.254, pruned_loss=0.06561, over 21240.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2937, pruned_loss=0.06205, over 4271068.27 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:14:35,782 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 15:14:56,362 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2655, simple_loss=0.3539, pruned_loss=0.08861, over 1796401.00 frames. 2023-06-28 15:14:56,363 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 15:15:25,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2084706.0, ans=0.0 2023-06-28 15:15:33,730 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:15:34,574 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.75 vs. limit=5.0 2023-06-28 15:15:57,094 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.607e+02 7.542e+02 1.127e+03 1.617e+03 2.900e+03, threshold=2.254e+03, percent-clipped=14.0 2023-06-28 15:16:08,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2084826.0, ans=0.125 2023-06-28 15:16:15,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=22.5 2023-06-28 15:16:15,858 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-28 15:16:37,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2084946.0, ans=0.0 2023-06-28 15:16:38,992 INFO [train.py:996] (3/4) Epoch 12, batch 12050, loss[loss=0.1912, simple_loss=0.2633, pruned_loss=0.05953, over 21884.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2895, pruned_loss=0.06317, over 4283078.62 frames. ], batch size: 283, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:16:59,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2085006.0, ans=0.0 2023-06-28 15:17:23,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2085066.0, ans=0.1 2023-06-28 15:17:31,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2085066.0, ans=0.1 2023-06-28 15:17:52,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2085126.0, ans=0.125 2023-06-28 15:18:22,718 INFO [train.py:996] (3/4) Epoch 12, batch 12100, loss[loss=0.2391, simple_loss=0.3149, pruned_loss=0.08171, over 21757.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.292, pruned_loss=0.06559, over 4286246.02 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:18:59,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2085306.0, ans=0.0 2023-06-28 15:19:28,427 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 8.334e+02 1.059e+03 1.614e+03 4.516e+03, threshold=2.118e+03, percent-clipped=9.0 2023-06-28 15:20:08,868 INFO [train.py:996] (3/4) Epoch 12, batch 12150, loss[loss=0.2185, simple_loss=0.3159, pruned_loss=0.06051, over 19706.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2955, pruned_loss=0.06615, over 4282299.59 frames. ], batch size: 704, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:21:31,258 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:21:39,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2085786.0, ans=0.0 2023-06-28 15:21:50,493 INFO [train.py:996] (3/4) Epoch 12, batch 12200, loss[loss=0.191, simple_loss=0.2529, pruned_loss=0.06449, over 21537.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2929, pruned_loss=0.06453, over 4280464.19 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:21:59,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2085846.0, ans=0.125 2023-06-28 15:22:09,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2085846.0, ans=0.125 2023-06-28 15:22:16,973 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.77 vs. limit=15.0 2023-06-28 15:22:30,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-28 15:22:34,566 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2085966.0, ans=0.0 2023-06-28 15:22:36,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2085966.0, ans=0.0 2023-06-28 15:22:53,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2085966.0, ans=0.95 2023-06-28 15:23:03,724 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.152e+02 7.335e+02 1.257e+03 1.849e+03 4.350e+03, threshold=2.514e+03, percent-clipped=17.0 2023-06-28 15:23:17,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2086086.0, ans=0.0 2023-06-28 15:23:33,608 INFO [train.py:996] (3/4) Epoch 12, batch 12250, loss[loss=0.1249, simple_loss=0.194, pruned_loss=0.02796, over 21768.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2862, pruned_loss=0.06259, over 4267224.19 frames. ], batch size: 107, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:24:19,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2086266.0, ans=0.125 2023-06-28 15:24:22,888 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2086266.0, ans=0.2 2023-06-28 15:24:34,128 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2086266.0, ans=0.0 2023-06-28 15:25:04,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2086386.0, ans=0.125 2023-06-28 15:25:09,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2086386.0, ans=0.125 2023-06-28 15:25:16,905 INFO [train.py:996] (3/4) Epoch 12, batch 12300, loss[loss=0.2398, simple_loss=0.3372, pruned_loss=0.07124, over 21711.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.2793, pruned_loss=0.05775, over 4268507.07 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:26:29,033 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.595e+02 6.953e+02 1.064e+03 1.818e+03 4.648e+03, threshold=2.128e+03, percent-clipped=12.0 2023-06-28 15:26:59,115 INFO [train.py:996] (3/4) Epoch 12, batch 12350, loss[loss=0.2118, simple_loss=0.2968, pruned_loss=0.0634, over 21482.00 frames. ], tot_loss[loss=0.2006, simple_loss=0.2843, pruned_loss=0.05844, over 4265536.33 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:27:27,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2086806.0, ans=0.125 2023-06-28 15:27:56,218 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2086866.0, ans=0.125 2023-06-28 15:28:28,180 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-28 15:28:40,053 INFO [train.py:996] (3/4) Epoch 12, batch 12400, loss[loss=0.2163, simple_loss=0.2793, pruned_loss=0.07661, over 21606.00 frames. ], tot_loss[loss=0.205, simple_loss=0.287, pruned_loss=0.06148, over 4269370.42 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:29:54,414 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.210e+02 8.096e+02 1.137e+03 1.573e+03 3.341e+03, threshold=2.274e+03, percent-clipped=11.0 2023-06-28 15:30:00,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2087226.0, ans=0.125 2023-06-28 15:30:05,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2087286.0, ans=0.125 2023-06-28 15:30:13,702 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2087286.0, ans=0.0 2023-06-28 15:30:18,583 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2087286.0, ans=0.2 2023-06-28 15:30:20,830 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-28 15:30:32,675 INFO [train.py:996] (3/4) Epoch 12, batch 12450, loss[loss=0.2547, simple_loss=0.3278, pruned_loss=0.0908, over 21319.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2896, pruned_loss=0.06352, over 4275836.22 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:30:48,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2087406.0, ans=0.1 2023-06-28 15:30:59,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2087406.0, ans=0.0 2023-06-28 15:31:18,153 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2087466.0, ans=0.125 2023-06-28 15:31:28,394 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2087466.0, ans=0.5 2023-06-28 15:31:36,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2087526.0, ans=0.125 2023-06-28 15:31:38,647 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2087526.0, ans=0.125 2023-06-28 15:32:11,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2087586.0, ans=0.125 2023-06-28 15:32:16,003 INFO [train.py:996] (3/4) Epoch 12, batch 12500, loss[loss=0.3115, simple_loss=0.4072, pruned_loss=0.1079, over 21405.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.3003, pruned_loss=0.06709, over 4274609.09 frames. ], batch size: 507, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:32:18,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2087646.0, ans=0.125 2023-06-28 15:32:25,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2087646.0, ans=0.125 2023-06-28 15:32:29,464 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.92 vs. limit=10.0 2023-06-28 15:32:53,343 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2087706.0, ans=0.0 2023-06-28 15:33:22,111 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.609e+02 8.449e+02 1.202e+03 1.905e+03 3.240e+03, threshold=2.404e+03, percent-clipped=12.0 2023-06-28 15:34:05,878 INFO [train.py:996] (3/4) Epoch 12, batch 12550, loss[loss=0.2441, simple_loss=0.3141, pruned_loss=0.08709, over 21393.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3062, pruned_loss=0.06979, over 4275183.39 frames. ], batch size: 549, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:34:08,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2087946.0, ans=0.0 2023-06-28 15:35:38,820 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.02 vs. limit=10.0 2023-06-28 15:35:50,638 INFO [train.py:996] (3/4) Epoch 12, batch 12600, loss[loss=0.155, simple_loss=0.2261, pruned_loss=0.04196, over 16811.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.3052, pruned_loss=0.06748, over 4271424.63 frames. ], batch size: 63, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:36:19,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2088306.0, ans=0.0 2023-06-28 15:36:59,613 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.816e+02 8.006e+02 1.115e+03 1.640e+03 2.498e+03, threshold=2.229e+03, percent-clipped=4.0 2023-06-28 15:37:09,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2088486.0, ans=0.0 2023-06-28 15:37:31,411 INFO [train.py:996] (3/4) Epoch 12, batch 12650, loss[loss=0.1867, simple_loss=0.2623, pruned_loss=0.05562, over 21786.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2973, pruned_loss=0.06417, over 4272721.94 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:37:43,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2088546.0, ans=0.0 2023-06-28 15:37:54,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2088606.0, ans=0.1 2023-06-28 15:37:56,124 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:38:24,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2088666.0, ans=0.125 2023-06-28 15:39:01,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2088786.0, ans=0.125 2023-06-28 15:39:18,831 INFO [train.py:996] (3/4) Epoch 12, batch 12700, loss[loss=0.2148, simple_loss=0.2936, pruned_loss=0.06804, over 21770.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2957, pruned_loss=0.06612, over 4282060.62 frames. ], batch size: 332, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:39:31,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2088846.0, ans=0.0 2023-06-28 15:39:55,967 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2088966.0, ans=0.125 2023-06-28 15:40:22,985 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.191e+02 7.639e+02 1.038e+03 1.743e+03 3.264e+03, threshold=2.075e+03, percent-clipped=12.0 2023-06-28 15:40:38,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2089086.0, ans=0.125 2023-06-28 15:41:01,160 INFO [train.py:996] (3/4) Epoch 12, batch 12750, loss[loss=0.2006, simple_loss=0.2905, pruned_loss=0.05535, over 21806.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2965, pruned_loss=0.06652, over 4276442.90 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:41:23,318 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:41:27,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2023-06-28 15:42:40,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2089386.0, ans=0.0 2023-06-28 15:42:43,109 INFO [train.py:996] (3/4) Epoch 12, batch 12800, loss[loss=0.2005, simple_loss=0.2734, pruned_loss=0.06376, over 21601.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.2939, pruned_loss=0.06685, over 4275038.35 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:42:45,681 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2089446.0, ans=0.125 2023-06-28 15:43:23,554 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2089506.0, ans=0.125 2023-06-28 15:43:55,075 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.524e+02 8.190e+02 1.176e+03 1.690e+03 3.535e+03, threshold=2.352e+03, percent-clipped=8.0 2023-06-28 15:44:09,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2089686.0, ans=0.1 2023-06-28 15:44:22,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2089686.0, ans=0.2 2023-06-28 15:44:26,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2089746.0, ans=0.0 2023-06-28 15:44:27,201 INFO [train.py:996] (3/4) Epoch 12, batch 12850, loss[loss=0.2193, simple_loss=0.3158, pruned_loss=0.06134, over 19910.00 frames. ], tot_loss[loss=0.2156, simple_loss=0.2949, pruned_loss=0.06813, over 4273570.26 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:44:44,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2089746.0, ans=0.1 2023-06-28 15:45:29,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2089866.0, ans=0.125 2023-06-28 15:45:52,666 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2089986.0, ans=0.0 2023-06-28 15:46:15,554 INFO [train.py:996] (3/4) Epoch 12, batch 12900, loss[loss=0.197, simple_loss=0.2914, pruned_loss=0.05132, over 20820.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2933, pruned_loss=0.06609, over 4273142.82 frames. ], batch size: 608, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:46:17,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2090046.0, ans=0.0 2023-06-28 15:46:43,155 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=15.0 2023-06-28 15:46:50,456 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2090106.0, ans=0.125 2023-06-28 15:47:13,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2090166.0, ans=0.1 2023-06-28 15:47:25,954 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.842e+02 7.681e+02 1.209e+03 1.743e+03 3.932e+03, threshold=2.418e+03, percent-clipped=11.0 2023-06-28 15:48:02,340 INFO [train.py:996] (3/4) Epoch 12, batch 12950, loss[loss=0.2818, simple_loss=0.3491, pruned_loss=0.1073, over 21433.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2924, pruned_loss=0.06493, over 4275193.33 frames. ], batch size: 509, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:48:18,598 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.93 vs. limit=15.0 2023-06-28 15:49:11,211 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-28 15:49:22,394 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2023-06-28 15:49:28,908 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2090586.0, ans=0.0 2023-06-28 15:49:42,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2090586.0, ans=0.125 2023-06-28 15:49:44,546 INFO [train.py:996] (3/4) Epoch 12, batch 13000, loss[loss=0.1704, simple_loss=0.2597, pruned_loss=0.04051, over 21691.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2938, pruned_loss=0.06514, over 4275491.56 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:49:47,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2090646.0, ans=0.0 2023-06-28 15:49:50,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2090646.0, ans=0.0 2023-06-28 15:50:50,318 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.116e+02 7.680e+02 1.047e+03 1.374e+03 2.853e+03, threshold=2.094e+03, percent-clipped=2.0 2023-06-28 15:51:02,584 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=22.5 2023-06-28 15:51:25,517 INFO [train.py:996] (3/4) Epoch 12, batch 13050, loss[loss=0.2039, simple_loss=0.278, pruned_loss=0.06488, over 21900.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2896, pruned_loss=0.06357, over 4274800.90 frames. ], batch size: 316, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:51:42,784 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 15:52:23,279 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2091126.0, ans=0.125 2023-06-28 15:52:47,585 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2091186.0, ans=0.0 2023-06-28 15:53:07,012 INFO [train.py:996] (3/4) Epoch 12, batch 13100, loss[loss=0.2097, simple_loss=0.2924, pruned_loss=0.06355, over 21320.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2903, pruned_loss=0.06311, over 4283006.48 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:53:12,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2091246.0, ans=0.125 2023-06-28 15:53:39,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2091306.0, ans=0.125 2023-06-28 15:53:58,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2091366.0, ans=0.0 2023-06-28 15:54:14,291 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.157e+02 6.908e+02 8.185e+02 1.188e+03 2.631e+03, threshold=1.637e+03, percent-clipped=2.0 2023-06-28 15:54:50,931 INFO [train.py:996] (3/4) Epoch 12, batch 13150, loss[loss=0.2254, simple_loss=0.2948, pruned_loss=0.07797, over 21753.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2926, pruned_loss=0.06523, over 4285457.92 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:54:59,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2091546.0, ans=0.125 2023-06-28 15:55:12,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2091606.0, ans=0.0 2023-06-28 15:55:36,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2091666.0, ans=0.125 2023-06-28 15:55:36,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2091666.0, ans=0.0 2023-06-28 15:55:59,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2091726.0, ans=0.1 2023-06-28 15:56:16,274 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-28 15:56:37,584 INFO [train.py:996] (3/4) Epoch 12, batch 13200, loss[loss=0.2302, simple_loss=0.3028, pruned_loss=0.07882, over 21454.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2903, pruned_loss=0.06452, over 4273325.96 frames. ], batch size: 211, lr: 2.43e-03, grad_scale: 32.0 2023-06-28 15:56:38,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2091846.0, ans=0.125 2023-06-28 15:57:18,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2091966.0, ans=0.125 2023-06-28 15:57:46,670 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.053e+02 7.241e+02 1.163e+03 1.743e+03 3.163e+03, threshold=2.326e+03, percent-clipped=27.0 2023-06-28 15:58:21,568 INFO [train.py:996] (3/4) Epoch 12, batch 13250, loss[loss=0.2008, simple_loss=0.2806, pruned_loss=0.06049, over 21280.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2917, pruned_loss=0.06641, over 4274872.40 frames. ], batch size: 176, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 15:58:40,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2092206.0, ans=0.0 2023-06-28 15:58:59,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2092266.0, ans=0.125 2023-06-28 15:59:22,325 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2092326.0, ans=0.0 2023-06-28 15:59:24,202 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2092326.0, ans=0.125 2023-06-28 16:00:05,145 INFO [train.py:996] (3/4) Epoch 12, batch 13300, loss[loss=0.2526, simple_loss=0.3281, pruned_loss=0.08856, over 21414.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.297, pruned_loss=0.06603, over 4274512.30 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 16:00:05,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2092446.0, ans=0.125 2023-06-28 16:00:19,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2092446.0, ans=0.035 2023-06-28 16:01:23,226 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.410e+02 8.803e+02 1.227e+03 2.112e+03 5.928e+03, threshold=2.454e+03, percent-clipped=20.0 2023-06-28 16:01:25,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2092626.0, ans=0.0 2023-06-28 16:01:27,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2092626.0, ans=0.0 2023-06-28 16:01:36,230 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=22.5 2023-06-28 16:01:39,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2092686.0, ans=0.125 2023-06-28 16:01:48,701 INFO [train.py:996] (3/4) Epoch 12, batch 13350, loss[loss=0.2732, simple_loss=0.3487, pruned_loss=0.09883, over 21450.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.3019, pruned_loss=0.06863, over 4275275.03 frames. ], batch size: 471, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 16:01:57,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2092746.0, ans=0.1 2023-06-28 16:02:23,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=22.5 2023-06-28 16:02:34,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2092866.0, ans=0.1 2023-06-28 16:03:10,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2092926.0, ans=0.125 2023-06-28 16:03:27,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2092986.0, ans=0.1 2023-06-28 16:03:29,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2092986.0, ans=0.0 2023-06-28 16:03:35,153 INFO [train.py:996] (3/4) Epoch 12, batch 13400, loss[loss=0.2148, simple_loss=0.2929, pruned_loss=0.06833, over 21833.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.302, pruned_loss=0.07026, over 4273618.91 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-28 16:03:54,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2093046.0, ans=0.125 2023-06-28 16:04:33,469 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-28 16:04:38,721 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.47 vs. limit=15.0 2023-06-28 16:04:44,339 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.284e+02 9.193e+02 1.348e+03 2.044e+03 4.158e+03, threshold=2.695e+03, percent-clipped=16.0 2023-06-28 16:05:14,145 INFO [train.py:996] (3/4) Epoch 12, batch 13450, loss[loss=0.1951, simple_loss=0.27, pruned_loss=0.06014, over 21410.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3026, pruned_loss=0.07157, over 4268922.69 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:05:25,390 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=12.0 2023-06-28 16:06:15,345 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-28 16:06:24,924 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2093526.0, ans=0.0 2023-06-28 16:06:58,131 INFO [train.py:996] (3/4) Epoch 12, batch 13500, loss[loss=0.2315, simple_loss=0.3143, pruned_loss=0.07431, over 21349.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2935, pruned_loss=0.06863, over 4264653.81 frames. ], batch size: 549, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:08:13,367 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.877e+02 6.964e+02 1.038e+03 1.541e+03 3.052e+03, threshold=2.076e+03, percent-clipped=2.0 2023-06-28 16:08:40,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2093886.0, ans=0.1 2023-06-28 16:08:43,471 INFO [train.py:996] (3/4) Epoch 12, batch 13550, loss[loss=0.2447, simple_loss=0.3461, pruned_loss=0.07168, over 21783.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2974, pruned_loss=0.06784, over 4268014.25 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:09:17,729 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-28 16:09:27,327 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-28 16:10:10,446 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2094186.0, ans=0.09899494936611666 2023-06-28 16:10:26,345 INFO [train.py:996] (3/4) Epoch 12, batch 13600, loss[loss=0.2065, simple_loss=0.283, pruned_loss=0.06497, over 21363.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2986, pruned_loss=0.06753, over 4275838.37 frames. ], batch size: 144, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:10:57,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2094306.0, ans=0.015 2023-06-28 16:11:06,596 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=12.0 2023-06-28 16:11:14,474 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2094366.0, ans=0.125 2023-06-28 16:11:33,493 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-06-28 16:11:39,038 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.839e+02 7.727e+02 1.210e+03 1.733e+03 4.112e+03, threshold=2.419e+03, percent-clipped=15.0 2023-06-28 16:11:52,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2094486.0, ans=0.0 2023-06-28 16:12:13,564 INFO [train.py:996] (3/4) Epoch 12, batch 13650, loss[loss=0.227, simple_loss=0.2768, pruned_loss=0.08858, over 21398.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2932, pruned_loss=0.06467, over 4278228.77 frames. ], batch size: 508, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:12:25,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2094546.0, ans=0.0 2023-06-28 16:12:34,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2094606.0, ans=0.125 2023-06-28 16:12:37,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2094606.0, ans=0.125 2023-06-28 16:13:57,088 INFO [train.py:996] (3/4) Epoch 12, batch 13700, loss[loss=0.1839, simple_loss=0.2418, pruned_loss=0.06297, over 21215.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2874, pruned_loss=0.06436, over 4271691.63 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:15:15,684 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.251e+02 7.877e+02 1.121e+03 1.931e+03 5.975e+03, threshold=2.242e+03, percent-clipped=12.0 2023-06-28 16:15:16,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.55 vs. limit=15.0 2023-06-28 16:15:28,475 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=15.0 2023-06-28 16:15:41,906 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.52 vs. limit=22.5 2023-06-28 16:15:47,589 INFO [train.py:996] (3/4) Epoch 12, batch 13750, loss[loss=0.1624, simple_loss=0.221, pruned_loss=0.05186, over 21788.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2847, pruned_loss=0.06417, over 4271132.11 frames. ], batch size: 118, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:15:48,255 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2095146.0, ans=0.125 2023-06-28 16:15:51,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2095146.0, ans=0.0 2023-06-28 16:16:36,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2095266.0, ans=0.0 2023-06-28 16:17:34,225 INFO [train.py:996] (3/4) Epoch 12, batch 13800, loss[loss=0.2614, simple_loss=0.377, pruned_loss=0.07291, over 21245.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2911, pruned_loss=0.06366, over 4263697.45 frames. ], batch size: 549, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:17:38,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2095446.0, ans=10.0 2023-06-28 16:17:40,533 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2095446.0, ans=0.05 2023-06-28 16:18:08,831 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2095506.0, ans=0.1 2023-06-28 16:18:38,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2095626.0, ans=0.0 2023-06-28 16:18:56,444 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.457e+02 7.505e+02 1.008e+03 1.759e+03 5.617e+03, threshold=2.016e+03, percent-clipped=13.0 2023-06-28 16:19:02,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2095686.0, ans=0.0 2023-06-28 16:19:17,298 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2095746.0, ans=0.0 2023-06-28 16:19:18,268 INFO [train.py:996] (3/4) Epoch 12, batch 13850, loss[loss=0.208, simple_loss=0.3062, pruned_loss=0.05491, over 21717.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2991, pruned_loss=0.06494, over 4264163.03 frames. ], batch size: 247, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:21:05,033 INFO [train.py:996] (3/4) Epoch 12, batch 13900, loss[loss=0.2178, simple_loss=0.293, pruned_loss=0.07129, over 21335.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3032, pruned_loss=0.06866, over 4265163.94 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:21:10,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2096046.0, ans=0.1 2023-06-28 16:21:30,951 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-28 16:21:48,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2096166.0, ans=0.125 2023-06-28 16:22:20,620 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.853e+02 9.390e+02 1.248e+03 1.935e+03 5.140e+03, threshold=2.497e+03, percent-clipped=23.0 2023-06-28 16:22:26,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2096286.0, ans=0.0 2023-06-28 16:22:35,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=2096286.0, ans=15.0 2023-06-28 16:22:47,370 INFO [train.py:996] (3/4) Epoch 12, batch 13950, loss[loss=0.2303, simple_loss=0.3065, pruned_loss=0.07708, over 21862.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3027, pruned_loss=0.07025, over 4277777.36 frames. ], batch size: 414, lr: 2.42e-03, grad_scale: 8.0 2023-06-28 16:22:51,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2096346.0, ans=0.2 2023-06-28 16:22:53,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2096346.0, ans=0.125 2023-06-28 16:23:08,122 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2096406.0, ans=0.0 2023-06-28 16:23:26,560 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=22.5 2023-06-28 16:24:17,395 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2096586.0, ans=0.125 2023-06-28 16:24:24,697 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=11.12 vs. limit=10.0 2023-06-28 16:24:25,022 INFO [train.py:996] (3/4) Epoch 12, batch 14000, loss[loss=0.1862, simple_loss=0.2875, pruned_loss=0.04248, over 21585.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2987, pruned_loss=0.06814, over 4283633.45 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:24:35,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2096646.0, ans=0.1 2023-06-28 16:25:06,469 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2096706.0, ans=0.0 2023-06-28 16:25:14,415 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2096766.0, ans=0.1 2023-06-28 16:25:43,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-28 16:25:45,215 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.785e+02 7.339e+02 1.044e+03 1.507e+03 3.234e+03, threshold=2.088e+03, percent-clipped=5.0 2023-06-28 16:25:54,548 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=12.0 2023-06-28 16:26:11,280 INFO [train.py:996] (3/4) Epoch 12, batch 14050, loss[loss=0.214, simple_loss=0.2699, pruned_loss=0.07901, over 21299.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2927, pruned_loss=0.06461, over 4276686.98 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:27:18,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2097126.0, ans=0.125 2023-06-28 16:27:24,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-28 16:27:48,675 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2097186.0, ans=0.125 2023-06-28 16:27:50,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2097186.0, ans=0.125 2023-06-28 16:27:52,910 INFO [train.py:996] (3/4) Epoch 12, batch 14100, loss[loss=0.203, simple_loss=0.2734, pruned_loss=0.06629, over 21562.00 frames. ], tot_loss[loss=0.2071, simple_loss=0.2854, pruned_loss=0.06435, over 4267535.68 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:27:53,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2097246.0, ans=0.125 2023-06-28 16:28:28,651 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.35 vs. limit=12.0 2023-06-28 16:28:36,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2097366.0, ans=0.125 2023-06-28 16:28:57,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2097426.0, ans=0.2 2023-06-28 16:29:08,545 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.981e+02 8.924e+02 1.261e+03 1.864e+03 4.328e+03, threshold=2.523e+03, percent-clipped=18.0 2023-06-28 16:29:17,116 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2097486.0, ans=0.0 2023-06-28 16:29:29,312 INFO [train.py:996] (3/4) Epoch 12, batch 14150, loss[loss=0.2059, simple_loss=0.2995, pruned_loss=0.05621, over 21780.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.2906, pruned_loss=0.06638, over 4275056.83 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:29:31,894 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-28 16:30:03,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2097606.0, ans=0.09899494936611666 2023-06-28 16:30:21,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2097666.0, ans=0.125 2023-06-28 16:31:07,772 INFO [train.py:996] (3/4) Epoch 12, batch 14200, loss[loss=0.2003, simple_loss=0.2797, pruned_loss=0.06044, over 21356.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2906, pruned_loss=0.06542, over 4280739.12 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:31:29,336 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-28 16:31:32,220 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2097906.0, ans=0.0 2023-06-28 16:31:39,201 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.57 vs. limit=15.0 2023-06-28 16:31:48,340 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2097966.0, ans=0.2 2023-06-28 16:32:13,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2098026.0, ans=0.125 2023-06-28 16:32:21,330 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.921e+02 6.799e+02 8.924e+02 1.241e+03 3.377e+03, threshold=1.785e+03, percent-clipped=4.0 2023-06-28 16:32:47,804 INFO [train.py:996] (3/4) Epoch 12, batch 14250, loss[loss=0.1656, simple_loss=0.2167, pruned_loss=0.05725, over 20838.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2854, pruned_loss=0.06476, over 4267416.51 frames. ], batch size: 608, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:32:53,660 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2098146.0, ans=0.1 2023-06-28 16:33:26,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2098206.0, ans=0.125 2023-06-28 16:33:28,335 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2098206.0, ans=0.025 2023-06-28 16:33:50,888 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=22.5 2023-06-28 16:34:34,308 INFO [train.py:996] (3/4) Epoch 12, batch 14300, loss[loss=0.2846, simple_loss=0.3808, pruned_loss=0.09421, over 21654.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.287, pruned_loss=0.06372, over 4264311.35 frames. ], batch size: 414, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:34:55,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2098506.0, ans=0.5 2023-06-28 16:35:05,224 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2098506.0, ans=0.1 2023-06-28 16:35:08,380 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2098506.0, ans=0.1 2023-06-28 16:35:36,505 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2098566.0, ans=0.0 2023-06-28 16:35:55,445 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.012e+02 8.237e+02 1.255e+03 2.124e+03 4.385e+03, threshold=2.511e+03, percent-clipped=34.0 2023-06-28 16:36:17,109 INFO [train.py:996] (3/4) Epoch 12, batch 14350, loss[loss=0.188, simple_loss=0.2544, pruned_loss=0.06082, over 21303.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2949, pruned_loss=0.06536, over 4258502.67 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:37:12,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2098866.0, ans=0.125 2023-06-28 16:37:59,235 INFO [train.py:996] (3/4) Epoch 12, batch 14400, loss[loss=0.1772, simple_loss=0.2483, pruned_loss=0.05299, over 21859.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2914, pruned_loss=0.06548, over 4266940.47 frames. ], batch size: 98, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:38:46,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2099166.0, ans=0.125 2023-06-28 16:39:00,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2099226.0, ans=0.2 2023-06-28 16:39:18,233 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.847e+02 6.927e+02 1.038e+03 1.645e+03 3.908e+03, threshold=2.076e+03, percent-clipped=8.0 2023-06-28 16:39:19,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.87 vs. limit=15.0 2023-06-28 16:39:39,677 INFO [train.py:996] (3/4) Epoch 12, batch 14450, loss[loss=0.2122, simple_loss=0.2937, pruned_loss=0.06539, over 21868.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2853, pruned_loss=0.06519, over 4268056.28 frames. ], batch size: 107, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:39:48,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2099346.0, ans=0.0 2023-06-28 16:40:09,780 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-28 16:40:27,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2099466.0, ans=0.125 2023-06-28 16:41:23,325 INFO [train.py:996] (3/4) Epoch 12, batch 14500, loss[loss=0.2281, simple_loss=0.3094, pruned_loss=0.07339, over 21593.00 frames. ], tot_loss[loss=0.205, simple_loss=0.281, pruned_loss=0.06447, over 4269552.56 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:42:16,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2099766.0, ans=0.125 2023-06-28 16:42:46,536 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.162e+02 7.722e+02 1.013e+03 1.611e+03 2.945e+03, threshold=2.026e+03, percent-clipped=11.0 2023-06-28 16:42:53,717 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:43:11,634 INFO [train.py:996] (3/4) Epoch 12, batch 14550, loss[loss=0.2103, simple_loss=0.2936, pruned_loss=0.06353, over 21420.00 frames. ], tot_loss[loss=0.2081, simple_loss=0.2851, pruned_loss=0.06554, over 4267771.74 frames. ], batch size: 211, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:43:14,020 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2099946.0, ans=0.0 2023-06-28 16:43:48,992 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-28 16:43:50,803 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-28 16:43:58,038 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-06-28 16:44:26,432 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2100126.0, ans=0.0 2023-06-28 16:44:33,628 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=22.5 2023-06-28 16:44:59,739 INFO [train.py:996] (3/4) Epoch 12, batch 14600, loss[loss=0.2312, simple_loss=0.3165, pruned_loss=0.07299, over 21433.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2923, pruned_loss=0.06941, over 4266349.06 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:45:36,859 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2100366.0, ans=0.0 2023-06-28 16:46:12,061 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.268e+02 8.624e+02 1.300e+03 2.155e+03 4.412e+03, threshold=2.599e+03, percent-clipped=26.0 2023-06-28 16:46:17,851 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2100486.0, ans=0.0 2023-06-28 16:46:40,691 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2100546.0, ans=0.125 2023-06-28 16:46:41,589 INFO [train.py:996] (3/4) Epoch 12, batch 14650, loss[loss=0.1832, simple_loss=0.2782, pruned_loss=0.04403, over 21615.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2957, pruned_loss=0.06873, over 4267247.12 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:47:12,975 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2100606.0, ans=0.125 2023-06-28 16:47:22,296 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.64 vs. limit=15.0 2023-06-28 16:48:10,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2100786.0, ans=0.125 2023-06-28 16:48:24,618 INFO [train.py:996] (3/4) Epoch 12, batch 14700, loss[loss=0.1555, simple_loss=0.2335, pruned_loss=0.03877, over 21333.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2897, pruned_loss=0.06305, over 4262939.78 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:48:38,864 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2100846.0, ans=0.1 2023-06-28 16:48:53,909 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2100906.0, ans=0.1 2023-06-28 16:49:17,106 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-28 16:49:39,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2101026.0, ans=0.1 2023-06-28 16:49:40,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.485e+02 7.512e+02 1.036e+03 1.553e+03 3.154e+03, threshold=2.072e+03, percent-clipped=4.0 2023-06-28 16:49:52,876 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-28 16:50:12,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2101086.0, ans=0.125 2023-06-28 16:50:12,725 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2101086.0, ans=0.125 2023-06-28 16:50:15,500 INFO [train.py:996] (3/4) Epoch 12, batch 14750, loss[loss=0.3813, simple_loss=0.4406, pruned_loss=0.161, over 21418.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2958, pruned_loss=0.06548, over 4265183.07 frames. ], batch size: 507, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:50:41,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2101206.0, ans=0.2 2023-06-28 16:50:45,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2101206.0, ans=0.125 2023-06-28 16:50:47,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-06-28 16:50:57,586 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2101266.0, ans=0.125 2023-06-28 16:51:07,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2101266.0, ans=0.125 2023-06-28 16:51:28,388 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 16:51:49,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2101386.0, ans=0.1 2023-06-28 16:51:58,893 INFO [train.py:996] (3/4) Epoch 12, batch 14800, loss[loss=0.2083, simple_loss=0.2863, pruned_loss=0.06514, over 21582.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.31, pruned_loss=0.07197, over 4268481.21 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 16:52:02,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2101446.0, ans=0.04949747468305833 2023-06-28 16:53:10,135 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2101626.0, ans=0.125 2023-06-28 16:53:24,529 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.184e+02 7.719e+02 1.255e+03 2.135e+03 5.182e+03, threshold=2.510e+03, percent-clipped=29.0 2023-06-28 16:53:32,118 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2101686.0, ans=0.0 2023-06-28 16:53:37,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2101686.0, ans=0.125 2023-06-28 16:53:39,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-28 16:53:44,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2101746.0, ans=0.1 2023-06-28 16:53:50,384 INFO [train.py:996] (3/4) Epoch 12, batch 14850, loss[loss=0.2506, simple_loss=0.3328, pruned_loss=0.08421, over 21570.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3041, pruned_loss=0.07139, over 4262447.24 frames. ], batch size: 414, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:54:00,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2101746.0, ans=0.1 2023-06-28 16:54:06,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2101806.0, ans=0.0 2023-06-28 16:54:48,162 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2101866.0, ans=0.1 2023-06-28 16:54:55,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2101926.0, ans=0.125 2023-06-28 16:54:58,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2101926.0, ans=0.2 2023-06-28 16:55:26,121 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2101986.0, ans=15.0 2023-06-28 16:55:27,100 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2101986.0, ans=0.125 2023-06-28 16:55:34,852 INFO [train.py:996] (3/4) Epoch 12, batch 14900, loss[loss=0.2214, simple_loss=0.2993, pruned_loss=0.07171, over 21365.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.306, pruned_loss=0.07252, over 4265047.88 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:55:45,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2102046.0, ans=0.1 2023-06-28 16:56:30,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2102166.0, ans=0.0 2023-06-28 16:56:55,256 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.996e+02 9.079e+02 1.317e+03 1.882e+03 4.138e+03, threshold=2.634e+03, percent-clipped=10.0 2023-06-28 16:57:14,110 INFO [train.py:996] (3/4) Epoch 12, batch 14950, loss[loss=0.29, simple_loss=0.3558, pruned_loss=0.1121, over 21386.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3048, pruned_loss=0.07164, over 4268721.83 frames. ], batch size: 507, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:57:34,915 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2102406.0, ans=0.125 2023-06-28 16:57:51,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2102406.0, ans=0.1 2023-06-28 16:58:00,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2102466.0, ans=0.125 2023-06-28 16:58:04,374 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2102466.0, ans=0.0 2023-06-28 16:58:16,139 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2102526.0, ans=0.125 2023-06-28 16:58:36,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2102586.0, ans=0.125 2023-06-28 16:58:52,427 INFO [train.py:996] (3/4) Epoch 12, batch 15000, loss[loss=0.2415, simple_loss=0.3173, pruned_loss=0.08288, over 21502.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3066, pruned_loss=0.07283, over 4271629.45 frames. ], batch size: 211, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 16:58:52,427 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 16:59:11,973 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2573, simple_loss=0.3458, pruned_loss=0.08437, over 1796401.00 frames. 2023-06-28 16:59:11,974 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 16:59:52,030 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2102766.0, ans=0.125 2023-06-28 17:00:15,286 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:00:28,470 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.412e+02 7.471e+02 1.040e+03 1.542e+03 3.461e+03, threshold=2.079e+03, percent-clipped=2.0 2023-06-28 17:00:57,512 INFO [train.py:996] (3/4) Epoch 12, batch 15050, loss[loss=0.2064, simple_loss=0.2923, pruned_loss=0.0602, over 21658.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3066, pruned_loss=0.07301, over 4272577.30 frames. ], batch size: 247, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:01:01,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2102946.0, ans=0.0 2023-06-28 17:01:02,431 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-28 17:02:22,517 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2103186.0, ans=0.125 2023-06-28 17:02:22,987 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=22.5 2023-06-28 17:02:26,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2103186.0, ans=0.125 2023-06-28 17:02:40,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2103186.0, ans=0.125 2023-06-28 17:02:41,385 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.62 vs. limit=10.0 2023-06-28 17:02:44,572 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2103246.0, ans=0.2 2023-06-28 17:02:45,489 INFO [train.py:996] (3/4) Epoch 12, batch 15100, loss[loss=0.2471, simple_loss=0.3312, pruned_loss=0.08144, over 21594.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3083, pruned_loss=0.07183, over 4273436.65 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:04:04,841 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.134e+02 7.864e+02 1.140e+03 1.681e+03 3.504e+03, threshold=2.280e+03, percent-clipped=13.0 2023-06-28 17:04:24,884 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2103486.0, ans=0.125 2023-06-28 17:04:27,707 INFO [train.py:996] (3/4) Epoch 12, batch 15150, loss[loss=0.2389, simple_loss=0.294, pruned_loss=0.09191, over 21256.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.303, pruned_loss=0.07185, over 4276308.61 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:05:17,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2103666.0, ans=0.0 2023-06-28 17:05:29,399 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-28 17:05:59,642 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:06:09,569 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2103846.0, ans=0.1 2023-06-28 17:06:10,665 INFO [train.py:996] (3/4) Epoch 12, batch 15200, loss[loss=0.1964, simple_loss=0.2929, pruned_loss=0.05001, over 21549.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2934, pruned_loss=0.06856, over 4264376.99 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 17:06:26,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2103906.0, ans=0.04949747468305833 2023-06-28 17:06:44,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2103906.0, ans=0.125 2023-06-28 17:06:49,398 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2103966.0, ans=0.125 2023-06-28 17:07:16,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2104026.0, ans=0.2 2023-06-28 17:07:18,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=12.0 2023-06-28 17:07:20,609 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.65 vs. limit=10.0 2023-06-28 17:07:34,246 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.768e+02 7.003e+02 9.713e+02 1.349e+03 2.577e+03, threshold=1.943e+03, percent-clipped=4.0 2023-06-28 17:07:52,326 INFO [train.py:996] (3/4) Epoch 12, batch 15250, loss[loss=0.1765, simple_loss=0.2538, pruned_loss=0.04964, over 21665.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2866, pruned_loss=0.06654, over 4262546.34 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:07:57,813 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2104146.0, ans=0.125 2023-06-28 17:08:02,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2104146.0, ans=0.04949747468305833 2023-06-28 17:08:26,974 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-28 17:08:55,174 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2104326.0, ans=0.0 2023-06-28 17:09:34,114 INFO [train.py:996] (3/4) Epoch 12, batch 15300, loss[loss=0.2454, simple_loss=0.3152, pruned_loss=0.08781, over 21795.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2891, pruned_loss=0.06938, over 4269954.42 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:09:36,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2104446.0, ans=0.0 2023-06-28 17:09:53,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2104446.0, ans=0.95 2023-06-28 17:09:56,861 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.53 vs. limit=10.0 2023-06-28 17:10:16,301 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:10:31,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2104566.0, ans=0.1 2023-06-28 17:10:56,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2104626.0, ans=0.5 2023-06-28 17:11:01,319 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.828e+02 9.652e+02 1.202e+03 1.838e+03 3.602e+03, threshold=2.404e+03, percent-clipped=24.0 2023-06-28 17:11:12,101 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=22.5 2023-06-28 17:11:14,895 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2104686.0, ans=0.0 2023-06-28 17:11:17,462 INFO [train.py:996] (3/4) Epoch 12, batch 15350, loss[loss=0.2146, simple_loss=0.3217, pruned_loss=0.0538, over 21621.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2955, pruned_loss=0.07145, over 4273132.50 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:11:32,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2104746.0, ans=0.0 2023-06-28 17:11:50,699 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2104806.0, ans=0.125 2023-06-28 17:12:56,970 INFO [train.py:996] (3/4) Epoch 12, batch 15400, loss[loss=0.1895, simple_loss=0.2755, pruned_loss=0.05177, over 21863.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2948, pruned_loss=0.06934, over 4261457.15 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:13:04,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2105046.0, ans=0.0 2023-06-28 17:14:16,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.384e+02 7.580e+02 1.010e+03 1.519e+03 4.001e+03, threshold=2.021e+03, percent-clipped=6.0 2023-06-28 17:14:30,301 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2105286.0, ans=0.0 2023-06-28 17:14:37,991 INFO [train.py:996] (3/4) Epoch 12, batch 15450, loss[loss=0.1993, simple_loss=0.2743, pruned_loss=0.06218, over 21473.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.293, pruned_loss=0.06876, over 4254564.50 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:14:38,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2105346.0, ans=0.125 2023-06-28 17:15:16,795 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2105466.0, ans=0.0 2023-06-28 17:15:23,007 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2105466.0, ans=0.125 2023-06-28 17:16:20,921 INFO [train.py:996] (3/4) Epoch 12, batch 15500, loss[loss=0.2575, simple_loss=0.3376, pruned_loss=0.08871, over 21745.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2953, pruned_loss=0.06828, over 4255301.33 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:16:32,009 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-28 17:17:20,910 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2105766.0, ans=10.0 2023-06-28 17:17:24,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2105826.0, ans=0.1 2023-06-28 17:17:46,454 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.645e+02 8.205e+02 1.251e+03 1.746e+03 3.424e+03, threshold=2.502e+03, percent-clipped=13.0 2023-06-28 17:17:57,389 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.68 vs. limit=10.0 2023-06-28 17:18:07,375 INFO [train.py:996] (3/4) Epoch 12, batch 15550, loss[loss=0.1914, simple_loss=0.28, pruned_loss=0.05146, over 21819.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.296, pruned_loss=0.06666, over 4256016.68 frames. ], batch size: 316, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:19:13,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2106126.0, ans=0.125 2023-06-28 17:19:14,872 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=2106126.0, ans=0.2 2023-06-28 17:19:31,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2106186.0, ans=0.1 2023-06-28 17:19:49,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2106246.0, ans=0.1 2023-06-28 17:19:49,314 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2106246.0, ans=0.0 2023-06-28 17:19:50,307 INFO [train.py:996] (3/4) Epoch 12, batch 15600, loss[loss=0.2193, simple_loss=0.3088, pruned_loss=0.06486, over 21505.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2892, pruned_loss=0.06517, over 4253398.13 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 32.0 2023-06-28 17:19:56,383 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-28 17:20:11,618 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2106306.0, ans=0.0 2023-06-28 17:20:13,535 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2106306.0, ans=0.125 2023-06-28 17:20:13,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2106306.0, ans=0.0 2023-06-28 17:20:28,274 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-28 17:21:08,511 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.963e+02 9.239e+02 1.318e+03 1.838e+03 4.350e+03, threshold=2.636e+03, percent-clipped=8.0 2023-06-28 17:21:12,365 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2106486.0, ans=0.125 2023-06-28 17:21:29,748 INFO [train.py:996] (3/4) Epoch 12, batch 15650, loss[loss=0.2291, simple_loss=0.2856, pruned_loss=0.08635, over 21275.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2887, pruned_loss=0.0649, over 4252553.35 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:21:38,467 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2106546.0, ans=0.025 2023-06-28 17:21:38,895 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=12.0 2023-06-28 17:22:50,488 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-28 17:22:53,448 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2106786.0, ans=0.0 2023-06-28 17:23:08,761 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-28 17:23:12,642 INFO [train.py:996] (3/4) Epoch 12, batch 15700, loss[loss=0.195, simple_loss=0.2667, pruned_loss=0.06169, over 21249.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.284, pruned_loss=0.06364, over 4258485.70 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:23:43,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-28 17:24:30,014 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.59 vs. limit=10.0 2023-06-28 17:24:39,909 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.792e+02 8.496e+02 1.514e+03 2.181e+03 4.345e+03, threshold=3.028e+03, percent-clipped=16.0 2023-06-28 17:24:42,597 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-28 17:24:54,651 INFO [train.py:996] (3/4) Epoch 12, batch 15750, loss[loss=0.1937, simple_loss=0.2717, pruned_loss=0.05789, over 21885.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2812, pruned_loss=0.06407, over 4262657.51 frames. ], batch size: 107, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:25:24,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.66 vs. limit=22.5 2023-06-28 17:26:14,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2107386.0, ans=0.0 2023-06-28 17:26:35,171 INFO [train.py:996] (3/4) Epoch 12, batch 15800, loss[loss=0.196, simple_loss=0.2609, pruned_loss=0.06559, over 21316.00 frames. ], tot_loss[loss=0.2027, simple_loss=0.2771, pruned_loss=0.06412, over 4252400.32 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:26:57,183 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2107446.0, ans=0.0 2023-06-28 17:28:01,456 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.867e+02 7.167e+02 8.955e+02 1.687e+03 3.256e+03, threshold=1.791e+03, percent-clipped=1.0 2023-06-28 17:28:16,331 INFO [train.py:996] (3/4) Epoch 12, batch 15850, loss[loss=0.256, simple_loss=0.3201, pruned_loss=0.09596, over 21367.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2794, pruned_loss=0.06594, over 4260771.82 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:28:37,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-28 17:29:53,386 INFO [train.py:996] (3/4) Epoch 12, batch 15900, loss[loss=0.2022, simple_loss=0.2855, pruned_loss=0.05941, over 21477.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2762, pruned_loss=0.06561, over 4261292.21 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:30:55,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.01 vs. limit=12.0 2023-06-28 17:31:00,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2023-06-28 17:31:15,599 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 7.381e+02 9.815e+02 1.486e+03 2.540e+03, threshold=1.963e+03, percent-clipped=11.0 2023-06-28 17:31:19,180 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2108286.0, ans=0.0 2023-06-28 17:31:20,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2108286.0, ans=0.2 2023-06-28 17:31:31,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2108286.0, ans=0.125 2023-06-28 17:31:34,466 INFO [train.py:996] (3/4) Epoch 12, batch 15950, loss[loss=0.1851, simple_loss=0.2833, pruned_loss=0.04341, over 21673.00 frames. ], tot_loss[loss=0.2034, simple_loss=0.279, pruned_loss=0.06386, over 4244315.57 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:31:40,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2108346.0, ans=0.0 2023-06-28 17:32:33,911 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2108466.0, ans=0.0 2023-06-28 17:32:47,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2108526.0, ans=0.125 2023-06-28 17:32:53,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2108586.0, ans=0.1 2023-06-28 17:33:11,196 INFO [train.py:996] (3/4) Epoch 12, batch 16000, loss[loss=0.2077, simple_loss=0.3049, pruned_loss=0.05519, over 21655.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2811, pruned_loss=0.06174, over 4260152.80 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:33:19,536 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2108646.0, ans=0.125 2023-06-28 17:33:42,528 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2108706.0, ans=0.0 2023-06-28 17:33:46,273 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-28 17:34:27,051 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2108826.0, ans=0.125 2023-06-28 17:34:39,581 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.641e+02 6.614e+02 9.934e+02 1.443e+03 3.349e+03, threshold=1.987e+03, percent-clipped=8.0 2023-06-28 17:34:51,955 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2108946.0, ans=0.125 2023-06-28 17:34:52,874 INFO [train.py:996] (3/4) Epoch 12, batch 16050, loss[loss=0.2432, simple_loss=0.3498, pruned_loss=0.06826, over 21844.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2837, pruned_loss=0.06015, over 4266564.22 frames. ], batch size: 371, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:35:37,086 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:35:39,072 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-28 17:36:22,579 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=12.0 2023-06-28 17:36:28,134 INFO [train.py:996] (3/4) Epoch 12, batch 16100, loss[loss=0.2123, simple_loss=0.2916, pruned_loss=0.06654, over 21880.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2886, pruned_loss=0.06151, over 4274632.67 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:36:28,650 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2109246.0, ans=0.1 2023-06-28 17:36:36,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2109246.0, ans=0.035 2023-06-28 17:36:41,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=2109246.0, ans=0.05 2023-06-28 17:37:30,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2109426.0, ans=0.125 2023-06-28 17:37:45,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2109426.0, ans=0.0 2023-06-28 17:37:51,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2109486.0, ans=0.0 2023-06-28 17:37:52,814 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.516e+02 1.028e+03 1.550e+03 2.496e+03 6.023e+03, threshold=3.100e+03, percent-clipped=39.0 2023-06-28 17:37:55,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2109486.0, ans=0.1 2023-06-28 17:38:05,316 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2109546.0, ans=0.05 2023-06-28 17:38:06,335 INFO [train.py:996] (3/4) Epoch 12, batch 16150, loss[loss=0.2201, simple_loss=0.2985, pruned_loss=0.0709, over 21485.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2882, pruned_loss=0.06373, over 4281448.36 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:38:58,868 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2109666.0, ans=0.0 2023-06-28 17:39:49,879 INFO [train.py:996] (3/4) Epoch 12, batch 16200, loss[loss=0.2522, simple_loss=0.332, pruned_loss=0.08619, over 21503.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2928, pruned_loss=0.06457, over 4281607.18 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:40:30,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2109906.0, ans=0.125 2023-06-28 17:40:55,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2110026.0, ans=0.1 2023-06-28 17:40:56,690 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.46 vs. limit=12.0 2023-06-28 17:41:21,145 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.167e+02 9.228e+02 1.472e+03 2.186e+03 5.217e+03, threshold=2.943e+03, percent-clipped=8.0 2023-06-28 17:41:30,429 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2110086.0, ans=0.0 2023-06-28 17:41:39,807 INFO [train.py:996] (3/4) Epoch 12, batch 16250, loss[loss=0.1635, simple_loss=0.2418, pruned_loss=0.04257, over 21408.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2935, pruned_loss=0.06485, over 4279342.67 frames. ], batch size: 211, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:41:40,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2110146.0, ans=0.0 2023-06-28 17:42:11,798 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2110206.0, ans=0.125 2023-06-28 17:42:33,613 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2110266.0, ans=0.0 2023-06-28 17:42:55,560 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2110386.0, ans=0.125 2023-06-28 17:43:22,707 INFO [train.py:996] (3/4) Epoch 12, batch 16300, loss[loss=0.1734, simple_loss=0.2603, pruned_loss=0.04326, over 21271.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2868, pruned_loss=0.06176, over 4279080.55 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-28 17:43:38,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2110446.0, ans=0.0 2023-06-28 17:44:17,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2110566.0, ans=0.125 2023-06-28 17:44:21,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.95 vs. limit=15.0 2023-06-28 17:44:47,671 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.729e+02 7.899e+02 1.103e+03 1.681e+03 3.393e+03, threshold=2.206e+03, percent-clipped=5.0 2023-06-28 17:44:59,815 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2110686.0, ans=0.1 2023-06-28 17:45:06,098 INFO [train.py:996] (3/4) Epoch 12, batch 16350, loss[loss=0.1935, simple_loss=0.2701, pruned_loss=0.0584, over 21546.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2859, pruned_loss=0.06267, over 4281969.72 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:45:21,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2110746.0, ans=0.125 2023-06-28 17:45:58,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2110866.0, ans=0.05 2023-06-28 17:46:20,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-28 17:46:51,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2110986.0, ans=0.2 2023-06-28 17:46:53,875 INFO [train.py:996] (3/4) Epoch 12, batch 16400, loss[loss=0.2284, simple_loss=0.3035, pruned_loss=0.07666, over 21827.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2885, pruned_loss=0.06374, over 4285271.42 frames. ], batch size: 124, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 17:47:06,548 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2111046.0, ans=0.125 2023-06-28 17:47:10,071 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2111106.0, ans=0.1 2023-06-28 17:47:18,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2111106.0, ans=0.2 2023-06-28 17:48:06,700 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2111226.0, ans=0.125 2023-06-28 17:48:16,143 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.551e+02 7.002e+02 9.291e+02 1.321e+03 2.557e+03, threshold=1.858e+03, percent-clipped=4.0 2023-06-28 17:48:25,218 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-28 17:48:37,400 INFO [train.py:996] (3/4) Epoch 12, batch 16450, loss[loss=0.1916, simple_loss=0.2655, pruned_loss=0.05888, over 21910.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.288, pruned_loss=0.06491, over 4292993.36 frames. ], batch size: 316, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:48:41,383 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2111346.0, ans=0.125 2023-06-28 17:48:46,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2111346.0, ans=0.125 2023-06-28 17:48:48,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=2111346.0, ans=15.0 2023-06-28 17:49:02,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2111406.0, ans=0.125 2023-06-28 17:49:06,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2111406.0, ans=0.2 2023-06-28 17:50:14,303 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:50:20,641 INFO [train.py:996] (3/4) Epoch 12, batch 16500, loss[loss=0.2134, simple_loss=0.3143, pruned_loss=0.05632, over 21215.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2882, pruned_loss=0.06552, over 4298963.22 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:50:28,728 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.41 vs. limit=10.0 2023-06-28 17:50:56,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2111706.0, ans=0.125 2023-06-28 17:51:52,101 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.183e+02 7.579e+02 1.164e+03 1.772e+03 4.926e+03, threshold=2.328e+03, percent-clipped=21.0 2023-06-28 17:51:56,324 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:52:09,282 INFO [train.py:996] (3/4) Epoch 12, batch 16550, loss[loss=0.2318, simple_loss=0.3253, pruned_loss=0.06917, over 21494.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2886, pruned_loss=0.06364, over 4296445.81 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:52:19,142 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-06-28 17:52:26,801 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2112006.0, ans=0.125 2023-06-28 17:52:40,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2112006.0, ans=0.07 2023-06-28 17:53:08,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2112066.0, ans=0.04949747468305833 2023-06-28 17:53:45,531 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2112186.0, ans=0.125 2023-06-28 17:53:50,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2112186.0, ans=0.1 2023-06-28 17:53:54,957 INFO [train.py:996] (3/4) Epoch 12, batch 16600, loss[loss=0.3119, simple_loss=0.4046, pruned_loss=0.1096, over 21674.00 frames. ], tot_loss[loss=0.2158, simple_loss=0.2973, pruned_loss=0.06712, over 4293097.82 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:54:27,827 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2112306.0, ans=0.0 2023-06-28 17:54:34,482 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=2112306.0, ans=0.2 2023-06-28 17:54:59,515 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2112426.0, ans=10.0 2023-06-28 17:55:19,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2112426.0, ans=0.1 2023-06-28 17:55:27,795 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.049e+02 7.685e+02 9.523e+02 1.400e+03 3.440e+03, threshold=1.905e+03, percent-clipped=5.0 2023-06-28 17:55:40,080 INFO [train.py:996] (3/4) Epoch 12, batch 16650, loss[loss=0.2318, simple_loss=0.3135, pruned_loss=0.07501, over 22018.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3076, pruned_loss=0.06951, over 4292803.68 frames. ], batch size: 317, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:56:07,210 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=8.0 2023-06-28 17:56:16,714 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 17:56:35,749 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.89 vs. limit=15.0 2023-06-28 17:56:40,644 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2112666.0, ans=0.0 2023-06-28 17:57:35,420 INFO [train.py:996] (3/4) Epoch 12, batch 16700, loss[loss=0.2305, simple_loss=0.3341, pruned_loss=0.06342, over 21179.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3097, pruned_loss=0.0706, over 4287209.59 frames. ], batch size: 549, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:57:36,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2112846.0, ans=0.125 2023-06-28 17:58:25,359 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2112966.0, ans=0.0 2023-06-28 17:58:36,260 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-28 17:58:46,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2113026.0, ans=0.1 2023-06-28 17:59:08,945 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.823e+02 8.945e+02 1.338e+03 1.942e+03 4.278e+03, threshold=2.675e+03, percent-clipped=28.0 2023-06-28 17:59:13,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.57 vs. limit=10.0 2023-06-28 17:59:26,624 INFO [train.py:996] (3/4) Epoch 12, batch 16750, loss[loss=0.2401, simple_loss=0.3248, pruned_loss=0.07771, over 21775.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3097, pruned_loss=0.07189, over 4280821.68 frames. ], batch size: 332, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 17:59:33,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2113146.0, ans=0.2 2023-06-28 17:59:52,292 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2113206.0, ans=0.1 2023-06-28 18:00:35,632 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2113326.0, ans=10.0 2023-06-28 18:00:44,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-06-28 18:01:11,590 INFO [train.py:996] (3/4) Epoch 12, batch 16800, loss[loss=0.2028, simple_loss=0.2739, pruned_loss=0.06587, over 21769.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.314, pruned_loss=0.07151, over 4283184.31 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 18:01:12,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2113446.0, ans=0.0 2023-06-28 18:01:37,320 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2113506.0, ans=0.0 2023-06-28 18:02:00,057 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2113566.0, ans=0.125 2023-06-28 18:02:44,385 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.630e+02 9.200e+02 1.390e+03 2.563e+03 4.897e+03, threshold=2.780e+03, percent-clipped=19.0 2023-06-28 18:02:58,983 INFO [train.py:996] (3/4) Epoch 12, batch 16850, loss[loss=0.2197, simple_loss=0.2895, pruned_loss=0.07494, over 21868.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3088, pruned_loss=0.07162, over 4291584.36 frames. ], batch size: 371, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:03:16,462 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2113806.0, ans=0.125 2023-06-28 18:03:16,983 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=15.0 2023-06-28 18:03:49,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-28 18:03:55,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2113866.0, ans=0.125 2023-06-28 18:04:40,766 INFO [train.py:996] (3/4) Epoch 12, batch 16900, loss[loss=0.1882, simple_loss=0.2663, pruned_loss=0.05511, over 21785.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.303, pruned_loss=0.0698, over 4290655.08 frames. ], batch size: 316, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:05:07,444 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2114106.0, ans=0.125 2023-06-28 18:05:13,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2114166.0, ans=0.125 2023-06-28 18:05:32,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2114166.0, ans=0.2 2023-06-28 18:05:34,411 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2114166.0, ans=0.05 2023-06-28 18:05:37,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2114226.0, ans=0.2 2023-06-28 18:05:59,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2114286.0, ans=0.125 2023-06-28 18:05:59,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2114286.0, ans=0.0 2023-06-28 18:06:02,751 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2114286.0, ans=0.2 2023-06-28 18:06:08,630 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 8.316e+02 1.157e+03 1.734e+03 4.199e+03, threshold=2.313e+03, percent-clipped=8.0 2023-06-28 18:06:21,744 INFO [train.py:996] (3/4) Epoch 12, batch 16950, loss[loss=0.2084, simple_loss=0.2866, pruned_loss=0.06511, over 21422.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2954, pruned_loss=0.06804, over 4286665.93 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:06:38,713 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2114406.0, ans=0.125 2023-06-28 18:06:47,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2114406.0, ans=0.2 2023-06-28 18:07:02,291 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2114466.0, ans=0.125 2023-06-28 18:07:37,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2114586.0, ans=0.0 2023-06-28 18:07:59,334 INFO [train.py:996] (3/4) Epoch 12, batch 17000, loss[loss=0.2349, simple_loss=0.3077, pruned_loss=0.08105, over 21876.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2914, pruned_loss=0.06876, over 4289806.32 frames. ], batch size: 118, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:08:11,649 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2114646.0, ans=0.125 2023-06-28 18:08:16,962 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2114706.0, ans=0.125 2023-06-28 18:08:19,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2114706.0, ans=0.0 2023-06-28 18:08:33,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2114706.0, ans=0.0 2023-06-28 18:08:39,919 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2114706.0, ans=0.0 2023-06-28 18:08:45,136 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2114766.0, ans=0.1 2023-06-28 18:08:58,735 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2114826.0, ans=0.125 2023-06-28 18:09:00,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2114826.0, ans=0.125 2023-06-28 18:09:12,073 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2114826.0, ans=0.1 2023-06-28 18:09:17,698 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-28 18:09:29,801 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.045e+02 1.097e+03 1.381e+03 1.822e+03 3.953e+03, threshold=2.762e+03, percent-clipped=12.0 2023-06-28 18:09:42,673 INFO [train.py:996] (3/4) Epoch 12, batch 17050, loss[loss=0.246, simple_loss=0.3641, pruned_loss=0.06397, over 20894.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2986, pruned_loss=0.07063, over 4292262.23 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:11:18,377 INFO [train.py:996] (3/4) Epoch 12, batch 17100, loss[loss=0.1916, simple_loss=0.2653, pruned_loss=0.05898, over 21699.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2969, pruned_loss=0.07076, over 4298128.58 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:11:22,323 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2115246.0, ans=0.125 2023-06-28 18:11:24,341 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-28 18:12:52,867 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.623e+02 7.702e+02 1.047e+03 1.626e+03 3.499e+03, threshold=2.095e+03, percent-clipped=2.0 2023-06-28 18:12:58,684 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2115486.0, ans=10.0 2023-06-28 18:13:01,295 INFO [train.py:996] (3/4) Epoch 12, batch 17150, loss[loss=0.1922, simple_loss=0.2761, pruned_loss=0.05414, over 21381.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2925, pruned_loss=0.07028, over 4297246.32 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:13:19,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-28 18:14:13,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=2115726.0, ans=10.0 2023-06-28 18:14:44,906 INFO [train.py:996] (3/4) Epoch 12, batch 17200, loss[loss=0.2227, simple_loss=0.2957, pruned_loss=0.07489, over 21434.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2923, pruned_loss=0.0693, over 4293020.45 frames. ], batch size: 211, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:14:55,825 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2115846.0, ans=0.125 2023-06-28 18:15:10,386 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2115906.0, ans=0.125 2023-06-28 18:15:53,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2116026.0, ans=0.125 2023-06-28 18:16:07,134 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2116026.0, ans=0.125 2023-06-28 18:16:07,148 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2116026.0, ans=0.125 2023-06-28 18:16:15,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2116086.0, ans=0.1 2023-06-28 18:16:20,218 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.974e+02 7.324e+02 9.389e+02 1.283e+03 2.769e+03, threshold=1.878e+03, percent-clipped=7.0 2023-06-28 18:16:33,063 INFO [train.py:996] (3/4) Epoch 12, batch 17250, loss[loss=0.2108, simple_loss=0.2793, pruned_loss=0.07119, over 19926.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2959, pruned_loss=0.07064, over 4284366.17 frames. ], batch size: 702, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:17:10,377 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2116266.0, ans=0.0 2023-06-28 18:17:38,092 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2116326.0, ans=0.0 2023-06-28 18:18:15,680 INFO [train.py:996] (3/4) Epoch 12, batch 17300, loss[loss=0.2644, simple_loss=0.3468, pruned_loss=0.09096, over 21478.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.304, pruned_loss=0.07376, over 4283065.54 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:19:16,679 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2116626.0, ans=0.0 2023-06-28 18:19:38,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2116686.0, ans=0.0 2023-06-28 18:19:48,139 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.078e+02 8.589e+02 1.215e+03 1.645e+03 3.725e+03, threshold=2.430e+03, percent-clipped=16.0 2023-06-28 18:19:56,939 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2116686.0, ans=0.1 2023-06-28 18:19:59,800 INFO [train.py:996] (3/4) Epoch 12, batch 17350, loss[loss=0.2706, simple_loss=0.3542, pruned_loss=0.09351, over 21486.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3048, pruned_loss=0.07372, over 4285557.11 frames. ], batch size: 508, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:20:19,388 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2116806.0, ans=0.125 2023-06-28 18:20:29,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2116806.0, ans=0.125 2023-06-28 18:20:40,742 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2116866.0, ans=0.125 2023-06-28 18:20:49,389 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2116866.0, ans=0.125 2023-06-28 18:21:42,603 INFO [train.py:996] (3/4) Epoch 12, batch 17400, loss[loss=0.202, simple_loss=0.2861, pruned_loss=0.05897, over 21739.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3011, pruned_loss=0.07052, over 4269277.35 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:21:50,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2117046.0, ans=0.1 2023-06-28 18:22:09,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2117106.0, ans=0.125 2023-06-28 18:22:14,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2117106.0, ans=0.1 2023-06-28 18:23:13,925 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.979e+02 8.447e+02 1.378e+03 1.932e+03 4.918e+03, threshold=2.756e+03, percent-clipped=14.0 2023-06-28 18:23:19,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=2117346.0, ans=0.05 2023-06-28 18:23:20,616 INFO [train.py:996] (3/4) Epoch 12, batch 17450, loss[loss=0.1913, simple_loss=0.2381, pruned_loss=0.07227, over 19941.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2977, pruned_loss=0.06853, over 4272951.65 frames. ], batch size: 704, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:23:30,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2117346.0, ans=0.025 2023-06-28 18:24:10,379 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2117466.0, ans=0.125 2023-06-28 18:24:27,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2117526.0, ans=0.1 2023-06-28 18:24:57,148 INFO [train.py:996] (3/4) Epoch 12, batch 17500, loss[loss=0.1933, simple_loss=0.2674, pruned_loss=0.05957, over 21397.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2947, pruned_loss=0.06712, over 4281961.91 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:25:25,436 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2117706.0, ans=0.125 2023-06-28 18:25:34,264 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-28 18:26:30,510 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.515e+02 7.082e+02 9.304e+02 1.343e+03 2.877e+03, threshold=1.861e+03, percent-clipped=1.0 2023-06-28 18:26:36,941 INFO [train.py:996] (3/4) Epoch 12, batch 17550, loss[loss=0.2264, simple_loss=0.3084, pruned_loss=0.07222, over 16417.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2947, pruned_loss=0.06603, over 4276058.70 frames. ], batch size: 65, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:26:51,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-06-28 18:26:58,930 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.66 vs. limit=15.0 2023-06-28 18:27:06,615 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2118006.0, ans=0.125 2023-06-28 18:27:13,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2118066.0, ans=0.125 2023-06-28 18:27:35,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2118066.0, ans=0.125 2023-06-28 18:28:03,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2118186.0, ans=0.1 2023-06-28 18:28:14,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2118186.0, ans=0.125 2023-06-28 18:28:18,262 INFO [train.py:996] (3/4) Epoch 12, batch 17600, loss[loss=0.237, simple_loss=0.316, pruned_loss=0.07899, over 21380.00 frames. ], tot_loss[loss=0.215, simple_loss=0.297, pruned_loss=0.06647, over 4267567.59 frames. ], batch size: 159, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:28:42,678 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2118306.0, ans=0.125 2023-06-28 18:29:02,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2118366.0, ans=0.125 2023-06-28 18:29:10,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2118366.0, ans=0.125 2023-06-28 18:29:51,278 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.265e+02 8.846e+02 1.006e+03 1.368e+03 3.785e+03, threshold=2.012e+03, percent-clipped=6.0 2023-06-28 18:30:03,184 INFO [train.py:996] (3/4) Epoch 12, batch 17650, loss[loss=0.1973, simple_loss=0.2815, pruned_loss=0.0565, over 21699.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2953, pruned_loss=0.06581, over 4258360.57 frames. ], batch size: 415, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:30:19,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2118606.0, ans=0.2 2023-06-28 18:31:03,382 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2118666.0, ans=0.0 2023-06-28 18:31:27,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2118786.0, ans=0.125 2023-06-28 18:31:29,245 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=22.5 2023-06-28 18:31:46,617 INFO [train.py:996] (3/4) Epoch 12, batch 17700, loss[loss=0.2264, simple_loss=0.3176, pruned_loss=0.06756, over 21472.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2885, pruned_loss=0.06353, over 4252896.02 frames. ], batch size: 194, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:32:21,086 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-28 18:32:27,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2118906.0, ans=0.125 2023-06-28 18:32:32,916 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2118966.0, ans=0.1 2023-06-28 18:32:39,511 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2118966.0, ans=0.0 2023-06-28 18:33:05,518 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=22.5 2023-06-28 18:33:09,894 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2119086.0, ans=0.125 2023-06-28 18:33:19,191 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.445e+02 8.687e+02 1.297e+03 2.273e+03 4.187e+03, threshold=2.595e+03, percent-clipped=29.0 2023-06-28 18:33:26,138 INFO [train.py:996] (3/4) Epoch 12, batch 17750, loss[loss=0.2527, simple_loss=0.3306, pruned_loss=0.08739, over 21718.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2948, pruned_loss=0.0663, over 4250231.12 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:34:01,991 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.51 vs. limit=22.5 2023-06-28 18:34:33,734 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2119266.0, ans=0.0 2023-06-28 18:34:41,014 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2119326.0, ans=0.125 2023-06-28 18:35:11,416 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2119386.0, ans=0.035 2023-06-28 18:35:20,464 INFO [train.py:996] (3/4) Epoch 12, batch 17800, loss[loss=0.217, simple_loss=0.3099, pruned_loss=0.06203, over 21627.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2971, pruned_loss=0.06686, over 4258422.15 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:35:54,721 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2119566.0, ans=0.0 2023-06-28 18:36:16,634 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2119626.0, ans=0.2 2023-06-28 18:36:29,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2119626.0, ans=0.125 2023-06-28 18:36:41,920 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-28 18:36:52,581 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.029e+02 8.129e+02 1.136e+03 1.993e+03 4.758e+03, threshold=2.272e+03, percent-clipped=17.0 2023-06-28 18:36:59,634 INFO [train.py:996] (3/4) Epoch 12, batch 17850, loss[loss=0.2266, simple_loss=0.2995, pruned_loss=0.07679, over 21329.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2982, pruned_loss=0.06736, over 4259679.51 frames. ], batch size: 159, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:37:05,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2119746.0, ans=0.125 2023-06-28 18:37:39,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2119866.0, ans=0.0 2023-06-28 18:37:55,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2119926.0, ans=0.1 2023-06-28 18:38:40,382 INFO [train.py:996] (3/4) Epoch 12, batch 17900, loss[loss=0.2088, simple_loss=0.3084, pruned_loss=0.05455, over 21633.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.3022, pruned_loss=0.0683, over 4262706.95 frames. ], batch size: 263, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:39:34,056 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-06-28 18:39:59,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2120226.0, ans=0.0 2023-06-28 18:40:12,405 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.160e+02 9.224e+02 1.391e+03 2.083e+03 4.254e+03, threshold=2.783e+03, percent-clipped=21.0 2023-06-28 18:40:19,125 INFO [train.py:996] (3/4) Epoch 12, batch 17950, loss[loss=0.1674, simple_loss=0.2532, pruned_loss=0.04081, over 21796.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2999, pruned_loss=0.06525, over 4251348.85 frames. ], batch size: 118, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:41:31,263 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-06-28 18:41:42,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2120586.0, ans=0.125 2023-06-28 18:41:56,544 INFO [train.py:996] (3/4) Epoch 12, batch 18000, loss[loss=0.1738, simple_loss=0.2481, pruned_loss=0.04974, over 21619.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2928, pruned_loss=0.06392, over 4252125.87 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 18:41:56,545 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 18:42:16,411 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2604, simple_loss=0.3527, pruned_loss=0.08401, over 1796401.00 frames. 2023-06-28 18:42:16,412 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 18:43:36,228 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-28 18:43:42,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2120886.0, ans=0.0 2023-06-28 18:43:55,005 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 7.241e+02 9.176e+02 1.211e+03 3.223e+03, threshold=1.835e+03, percent-clipped=1.0 2023-06-28 18:43:59,755 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=2120946.0, ans=15.0 2023-06-28 18:44:00,007 INFO [train.py:996] (3/4) Epoch 12, batch 18050, loss[loss=0.2079, simple_loss=0.2659, pruned_loss=0.07496, over 15035.00 frames. ], tot_loss[loss=0.2064, simple_loss=0.2871, pruned_loss=0.0629, over 4248167.84 frames. ], batch size: 61, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:44:24,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2121006.0, ans=0.0 2023-06-28 18:44:34,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-28 18:44:41,129 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2121006.0, ans=0.125 2023-06-28 18:45:12,089 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2121126.0, ans=0.04949747468305833 2023-06-28 18:45:41,532 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2121186.0, ans=0.0 2023-06-28 18:45:44,358 INFO [train.py:996] (3/4) Epoch 12, batch 18100, loss[loss=0.2367, simple_loss=0.3243, pruned_loss=0.07455, over 21217.00 frames. ], tot_loss[loss=0.212, simple_loss=0.2923, pruned_loss=0.06581, over 4252155.06 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:45:45,012 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:47:05,684 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:47:16,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2121486.0, ans=0.125 2023-06-28 18:47:16,874 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2121486.0, ans=0.0 2023-06-28 18:47:23,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.434e+02 8.761e+02 1.193e+03 1.712e+03 3.705e+03, threshold=2.386e+03, percent-clipped=21.0 2023-06-28 18:47:26,564 INFO [train.py:996] (3/4) Epoch 12, batch 18150, loss[loss=0.19, simple_loss=0.2665, pruned_loss=0.05668, over 21818.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2939, pruned_loss=0.06523, over 4264037.70 frames. ], batch size: 317, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:48:07,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2121606.0, ans=0.0 2023-06-28 18:48:44,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2121726.0, ans=0.2 2023-06-28 18:48:45,392 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.03 vs. limit=6.0 2023-06-28 18:49:01,534 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2121786.0, ans=0.125 2023-06-28 18:49:08,821 INFO [train.py:996] (3/4) Epoch 12, batch 18200, loss[loss=0.2027, simple_loss=0.2624, pruned_loss=0.0715, over 21498.00 frames. ], tot_loss[loss=0.2092, simple_loss=0.2883, pruned_loss=0.06507, over 4247909.83 frames. ], batch size: 391, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:49:45,648 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2121906.0, ans=0.125 2023-06-28 18:49:47,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2121906.0, ans=0.1 2023-06-28 18:49:55,311 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2121966.0, ans=0.125 2023-06-28 18:50:05,632 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-28 18:50:10,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2122026.0, ans=0.125 2023-06-28 18:50:15,239 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-28 18:50:41,808 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2122086.0, ans=0.2 2023-06-28 18:50:43,317 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2122086.0, ans=0.125 2023-06-28 18:50:44,441 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.871e+02 6.470e+02 8.191e+02 1.481e+03 3.644e+03, threshold=1.638e+03, percent-clipped=8.0 2023-06-28 18:50:48,138 INFO [train.py:996] (3/4) Epoch 12, batch 18250, loss[loss=0.2036, simple_loss=0.2736, pruned_loss=0.06678, over 21928.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2816, pruned_loss=0.06327, over 4252969.23 frames. ], batch size: 333, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:51:02,144 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-28 18:51:29,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2122266.0, ans=0.1 2023-06-28 18:51:33,785 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2122266.0, ans=0.125 2023-06-28 18:51:39,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2122266.0, ans=0.0 2023-06-28 18:52:25,410 INFO [train.py:996] (3/4) Epoch 12, batch 18300, loss[loss=0.2306, simple_loss=0.3521, pruned_loss=0.05459, over 20847.00 frames. ], tot_loss[loss=0.2049, simple_loss=0.2829, pruned_loss=0.06343, over 4264822.69 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:52:51,858 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2122506.0, ans=0.2 2023-06-28 18:54:03,759 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.548e+02 1.033e+03 1.487e+03 2.193e+03 4.357e+03, threshold=2.975e+03, percent-clipped=43.0 2023-06-28 18:54:06,760 INFO [train.py:996] (3/4) Epoch 12, batch 18350, loss[loss=0.2265, simple_loss=0.3043, pruned_loss=0.07434, over 21611.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.287, pruned_loss=0.06339, over 4261037.97 frames. ], batch size: 414, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 18:54:31,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2122806.0, ans=0.0 2023-06-28 18:55:49,916 INFO [train.py:996] (3/4) Epoch 12, batch 18400, loss[loss=0.2161, simple_loss=0.3124, pruned_loss=0.05992, over 21304.00 frames. ], tot_loss[loss=0.2031, simple_loss=0.2825, pruned_loss=0.06182, over 4259658.67 frames. ], batch size: 551, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:56:41,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2123166.0, ans=0.125 2023-06-28 18:56:56,064 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2123226.0, ans=0.0 2023-06-28 18:56:57,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2123226.0, ans=0.125 2023-06-28 18:57:15,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2123286.0, ans=0.1 2023-06-28 18:57:22,742 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.644e+02 6.567e+02 9.671e+02 1.816e+03 3.680e+03, threshold=1.934e+03, percent-clipped=2.0 2023-06-28 18:57:26,091 INFO [train.py:996] (3/4) Epoch 12, batch 18450, loss[loss=0.1804, simple_loss=0.2605, pruned_loss=0.05014, over 21287.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2803, pruned_loss=0.05935, over 4262205.41 frames. ], batch size: 551, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:57:39,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2123346.0, ans=0.1 2023-06-28 18:58:13,802 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2123466.0, ans=0.125 2023-06-28 18:58:23,763 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2123466.0, ans=0.2 2023-06-28 18:58:28,670 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2123526.0, ans=0.07 2023-06-28 18:59:04,959 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-28 18:59:07,201 INFO [train.py:996] (3/4) Epoch 12, batch 18500, loss[loss=0.1749, simple_loss=0.2459, pruned_loss=0.05197, over 21298.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2751, pruned_loss=0.05813, over 4264702.10 frames. ], batch size: 144, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 18:59:07,773 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 18:59:11,530 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=15.0 2023-06-28 19:00:45,281 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 8.087e+02 1.310e+03 2.007e+03 4.820e+03, threshold=2.620e+03, percent-clipped=25.0 2023-06-28 19:00:48,732 INFO [train.py:996] (3/4) Epoch 12, batch 18550, loss[loss=0.1943, simple_loss=0.2614, pruned_loss=0.06358, over 21873.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2738, pruned_loss=0.05731, over 4247956.45 frames. ], batch size: 107, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:01:42,917 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2124066.0, ans=0.0 2023-06-28 19:02:17,821 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2124186.0, ans=0.1 2023-06-28 19:02:32,387 INFO [train.py:996] (3/4) Epoch 12, batch 18600, loss[loss=0.2367, simple_loss=0.3219, pruned_loss=0.07575, over 21877.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2722, pruned_loss=0.05821, over 4239392.12 frames. ], batch size: 373, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:02:44,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2124246.0, ans=0.125 2023-06-28 19:03:13,878 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2124306.0, ans=0.025 2023-06-28 19:03:48,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2124426.0, ans=0.125 2023-06-28 19:03:52,390 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.58 vs. limit=10.0 2023-06-28 19:04:01,308 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2124486.0, ans=0.125 2023-06-28 19:04:12,043 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.057e+02 7.816e+02 1.103e+03 1.650e+03 3.069e+03, threshold=2.205e+03, percent-clipped=1.0 2023-06-28 19:04:13,753 INFO [train.py:996] (3/4) Epoch 12, batch 18650, loss[loss=0.1844, simple_loss=0.2594, pruned_loss=0.0547, over 21526.00 frames. ], tot_loss[loss=0.1943, simple_loss=0.2719, pruned_loss=0.05836, over 4248005.35 frames. ], batch size: 195, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 19:05:16,333 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2124726.0, ans=0.125 2023-06-28 19:05:43,182 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=11.22 vs. limit=15.0 2023-06-28 19:05:55,209 INFO [train.py:996] (3/4) Epoch 12, batch 18700, loss[loss=0.1917, simple_loss=0.2699, pruned_loss=0.05673, over 21986.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.27, pruned_loss=0.05964, over 4253904.25 frames. ], batch size: 113, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 19:07:09,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2125026.0, ans=0.1 2023-06-28 19:07:32,994 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2125086.0, ans=0.0 2023-06-28 19:07:35,698 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.958e+02 6.838e+02 8.648e+02 1.289e+03 2.694e+03, threshold=1.730e+03, percent-clipped=5.0 2023-06-28 19:07:37,312 INFO [train.py:996] (3/4) Epoch 12, batch 18750, loss[loss=0.2021, simple_loss=0.2696, pruned_loss=0.06733, over 21313.00 frames. ], tot_loss[loss=0.1975, simple_loss=0.2717, pruned_loss=0.06166, over 4250752.84 frames. ], batch size: 144, lr: 2.41e-03, grad_scale: 8.0 2023-06-28 19:08:53,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2125326.0, ans=0.2 2023-06-28 19:08:55,540 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.97 vs. limit=6.0 2023-06-28 19:09:10,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2125386.0, ans=0.05 2023-06-28 19:09:19,259 INFO [train.py:996] (3/4) Epoch 12, batch 18800, loss[loss=0.1945, simple_loss=0.2891, pruned_loss=0.04991, over 21747.00 frames. ], tot_loss[loss=0.2012, simple_loss=0.2768, pruned_loss=0.06282, over 4250373.83 frames. ], batch size: 351, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:10:22,231 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2125626.0, ans=0.1 2023-06-28 19:10:32,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2125626.0, ans=0.05 2023-06-28 19:10:42,352 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-28 19:10:58,897 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.378e+02 7.621e+02 1.255e+03 1.956e+03 3.877e+03, threshold=2.510e+03, percent-clipped=29.0 2023-06-28 19:11:00,574 INFO [train.py:996] (3/4) Epoch 12, batch 18850, loss[loss=0.1646, simple_loss=0.2629, pruned_loss=0.03318, over 21803.00 frames. ], tot_loss[loss=0.1971, simple_loss=0.2759, pruned_loss=0.05916, over 4254462.99 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:11:30,384 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.41 vs. limit=15.0 2023-06-28 19:11:52,421 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2125866.0, ans=0.125 2023-06-28 19:12:40,425 INFO [train.py:996] (3/4) Epoch 12, batch 18900, loss[loss=0.2072, simple_loss=0.2735, pruned_loss=0.07042, over 21819.00 frames. ], tot_loss[loss=0.1948, simple_loss=0.2719, pruned_loss=0.05886, over 4241995.49 frames. ], batch size: 351, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:13:11,477 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2126106.0, ans=0.0 2023-06-28 19:13:16,806 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.31 vs. limit=10.0 2023-06-28 19:13:36,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2126166.0, ans=0.2 2023-06-28 19:13:44,733 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=15.0 2023-06-28 19:13:46,536 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-28 19:14:04,018 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2126286.0, ans=0.0 2023-06-28 19:14:05,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2126286.0, ans=0.07 2023-06-28 19:14:14,967 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.285e+02 7.566e+02 1.259e+03 1.840e+03 2.966e+03, threshold=2.518e+03, percent-clipped=3.0 2023-06-28 19:14:16,577 INFO [train.py:996] (3/4) Epoch 12, batch 18950, loss[loss=0.2566, simple_loss=0.3146, pruned_loss=0.09925, over 21725.00 frames. ], tot_loss[loss=0.1974, simple_loss=0.273, pruned_loss=0.06096, over 4247890.68 frames. ], batch size: 508, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:14:52,928 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=22.5 2023-06-28 19:15:09,982 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-28 19:15:34,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2126526.0, ans=0.1 2023-06-28 19:15:43,959 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2126586.0, ans=0.2 2023-06-28 19:15:47,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2126586.0, ans=0.2 2023-06-28 19:15:55,092 INFO [train.py:996] (3/4) Epoch 12, batch 19000, loss[loss=0.2274, simple_loss=0.3078, pruned_loss=0.07348, over 21483.00 frames. ], tot_loss[loss=0.2037, simple_loss=0.2822, pruned_loss=0.06265, over 4255613.62 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:16:40,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2126766.0, ans=0.2 2023-06-28 19:16:44,026 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2126766.0, ans=0.125 2023-06-28 19:17:02,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2126826.0, ans=0.125 2023-06-28 19:17:14,166 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2126826.0, ans=0.125 2023-06-28 19:17:21,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2126886.0, ans=0.125 2023-06-28 19:17:32,193 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.360e+02 7.287e+02 9.721e+02 1.319e+03 3.703e+03, threshold=1.944e+03, percent-clipped=9.0 2023-06-28 19:17:33,804 INFO [train.py:996] (3/4) Epoch 12, batch 19050, loss[loss=0.2141, simple_loss=0.2747, pruned_loss=0.07677, over 20042.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2861, pruned_loss=0.06548, over 4262299.35 frames. ], batch size: 703, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:17:36,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2126946.0, ans=0.0 2023-06-28 19:17:54,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2127006.0, ans=0.1 2023-06-28 19:17:58,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2127006.0, ans=0.2 2023-06-28 19:18:25,309 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2127066.0, ans=0.0 2023-06-28 19:19:16,205 INFO [train.py:996] (3/4) Epoch 12, batch 19100, loss[loss=0.1873, simple_loss=0.2555, pruned_loss=0.05954, over 21251.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2844, pruned_loss=0.06606, over 4268249.20 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:19:47,696 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-28 19:20:12,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2127366.0, ans=0.05 2023-06-28 19:20:25,927 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2127426.0, ans=0.0 2023-06-28 19:21:01,435 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.479e+02 7.973e+02 1.169e+03 1.755e+03 3.524e+03, threshold=2.338e+03, percent-clipped=19.0 2023-06-28 19:21:03,172 INFO [train.py:996] (3/4) Epoch 12, batch 19150, loss[loss=0.2342, simple_loss=0.3321, pruned_loss=0.06813, over 21684.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.287, pruned_loss=0.06694, over 4267153.85 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:21:38,966 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2127606.0, ans=0.035 2023-06-28 19:21:51,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2127666.0, ans=0.0 2023-06-28 19:21:51,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2127666.0, ans=0.0 2023-06-28 19:22:08,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-28 19:22:33,313 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2127786.0, ans=0.125 2023-06-28 19:22:52,898 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2127846.0, ans=0.125 2023-06-28 19:22:53,938 INFO [train.py:996] (3/4) Epoch 12, batch 19200, loss[loss=0.2703, simple_loss=0.388, pruned_loss=0.07634, over 20765.00 frames. ], tot_loss[loss=0.2174, simple_loss=0.2982, pruned_loss=0.06829, over 4264426.13 frames. ], batch size: 607, lr: 2.41e-03, grad_scale: 32.0 2023-06-28 19:24:36,176 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.048e+02 8.519e+02 1.165e+03 1.659e+03 4.865e+03, threshold=2.330e+03, percent-clipped=13.0 2023-06-28 19:24:36,206 INFO [train.py:996] (3/4) Epoch 12, batch 19250, loss[loss=0.196, simple_loss=0.2698, pruned_loss=0.06109, over 21529.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2981, pruned_loss=0.06399, over 4267662.11 frames. ], batch size: 144, lr: 2.41e-03, grad_scale: 16.0 2023-06-28 19:24:36,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2128146.0, ans=0.2 2023-06-28 19:24:39,118 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-28 19:24:41,726 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2128146.0, ans=0.0 2023-06-28 19:24:42,361 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-28 19:24:48,460 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2128146.0, ans=0.125 2023-06-28 19:24:56,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2128206.0, ans=0.125 2023-06-28 19:25:16,249 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2128266.0, ans=0.1 2023-06-28 19:25:18,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2128266.0, ans=0.1 2023-06-28 19:25:28,284 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2128266.0, ans=0.0 2023-06-28 19:25:59,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2128386.0, ans=0.125 2023-06-28 19:26:18,597 INFO [train.py:996] (3/4) Epoch 12, batch 19300, loss[loss=0.159, simple_loss=0.2562, pruned_loss=0.0309, over 21581.00 frames. ], tot_loss[loss=0.2108, simple_loss=0.2949, pruned_loss=0.06337, over 4275773.25 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:26:34,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-06-28 19:27:21,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2128626.0, ans=0.035 2023-06-28 19:27:57,287 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.967e+02 7.718e+02 1.195e+03 1.796e+03 4.248e+03, threshold=2.390e+03, percent-clipped=9.0 2023-06-28 19:27:57,333 INFO [train.py:996] (3/4) Epoch 12, batch 19350, loss[loss=0.2408, simple_loss=0.3259, pruned_loss=0.07784, over 21536.00 frames. ], tot_loss[loss=0.2057, simple_loss=0.2902, pruned_loss=0.06061, over 4276887.67 frames. ], batch size: 473, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:28:42,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=22.5 2023-06-28 19:28:57,441 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-28 19:29:14,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2128986.0, ans=0.0 2023-06-28 19:29:37,798 INFO [train.py:996] (3/4) Epoch 12, batch 19400, loss[loss=0.1654, simple_loss=0.2481, pruned_loss=0.04134, over 21620.00 frames. ], tot_loss[loss=0.2032, simple_loss=0.287, pruned_loss=0.05966, over 4272787.10 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:30:06,266 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2129106.0, ans=0.125 2023-06-28 19:30:09,553 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2129106.0, ans=0.1 2023-06-28 19:30:56,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2129286.0, ans=0.125 2023-06-28 19:31:19,980 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.788e+02 6.972e+02 8.917e+02 1.265e+03 3.232e+03, threshold=1.783e+03, percent-clipped=5.0 2023-06-28 19:31:20,010 INFO [train.py:996] (3/4) Epoch 12, batch 19450, loss[loss=0.1975, simple_loss=0.2587, pruned_loss=0.06816, over 21269.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2835, pruned_loss=0.06048, over 4280614.64 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:31:20,652 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2129346.0, ans=0.125 2023-06-28 19:31:36,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2129346.0, ans=0.125 2023-06-28 19:31:42,400 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.76 vs. limit=15.0 2023-06-28 19:32:27,949 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2129526.0, ans=0.125 2023-06-28 19:32:54,661 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2129586.0, ans=0.1 2023-06-28 19:33:02,589 INFO [train.py:996] (3/4) Epoch 12, batch 19500, loss[loss=0.2787, simple_loss=0.3457, pruned_loss=0.1059, over 21453.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2807, pruned_loss=0.06204, over 4285814.28 frames. ], batch size: 507, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 19:33:17,192 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-28 19:33:20,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2129646.0, ans=0.0 2023-06-28 19:33:30,321 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2129706.0, ans=0.125 2023-06-28 19:33:43,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2129766.0, ans=0.125 2023-06-28 19:33:51,423 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2129766.0, ans=0.2 2023-06-28 19:33:55,030 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-28 19:34:30,420 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.79 vs. limit=6.0 2023-06-28 19:34:43,707 INFO [train.py:996] (3/4) Epoch 12, batch 19550, loss[loss=0.2043, simple_loss=0.3048, pruned_loss=0.05185, over 21850.00 frames. ], tot_loss[loss=0.1987, simple_loss=0.2765, pruned_loss=0.0604, over 4277423.38 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 19:34:45,238 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 7.099e+02 1.131e+03 1.724e+03 3.417e+03, threshold=2.262e+03, percent-clipped=22.0 2023-06-28 19:35:30,608 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-28 19:36:13,272 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2130186.0, ans=0.125 2023-06-28 19:36:25,918 INFO [train.py:996] (3/4) Epoch 12, batch 19600, loss[loss=0.203, simple_loss=0.2762, pruned_loss=0.06488, over 21095.00 frames. ], tot_loss[loss=0.2007, simple_loss=0.2788, pruned_loss=0.06126, over 4280673.49 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:36:34,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2130246.0, ans=0.0 2023-06-28 19:36:47,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2130306.0, ans=0.0 2023-06-28 19:38:03,555 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2130486.0, ans=0.1 2023-06-28 19:38:14,411 INFO [train.py:996] (3/4) Epoch 12, batch 19650, loss[loss=0.2004, simple_loss=0.2807, pruned_loss=0.06005, over 21797.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2832, pruned_loss=0.06387, over 4274956.06 frames. ], batch size: 332, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:38:16,157 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.322e+02 7.698e+02 1.187e+03 1.875e+03 3.672e+03, threshold=2.374e+03, percent-clipped=11.0 2023-06-28 19:38:40,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2130606.0, ans=0.0 2023-06-28 19:38:42,556 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 19:39:13,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2130726.0, ans=0.125 2023-06-28 19:39:29,572 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 19:40:00,290 INFO [train.py:996] (3/4) Epoch 12, batch 19700, loss[loss=0.2044, simple_loss=0.3031, pruned_loss=0.05288, over 21765.00 frames. ], tot_loss[loss=0.2087, simple_loss=0.2864, pruned_loss=0.06553, over 4277015.32 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:40:58,155 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2130966.0, ans=0.07 2023-06-28 19:41:39,859 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-28 19:41:50,325 INFO [train.py:996] (3/4) Epoch 12, batch 19750, loss[loss=0.2297, simple_loss=0.3354, pruned_loss=0.06194, over 21655.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2928, pruned_loss=0.0658, over 4269027.29 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:41:51,905 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.980e+02 8.894e+02 1.243e+03 1.861e+03 5.840e+03, threshold=2.486e+03, percent-clipped=14.0 2023-06-28 19:42:07,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.41 vs. limit=10.0 2023-06-28 19:42:45,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2131266.0, ans=0.125 2023-06-28 19:43:22,739 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2131386.0, ans=0.125 2023-06-28 19:43:31,905 INFO [train.py:996] (3/4) Epoch 12, batch 19800, loss[loss=0.1726, simple_loss=0.2487, pruned_loss=0.04821, over 21443.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2925, pruned_loss=0.06656, over 4282511.62 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:43:47,777 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2131506.0, ans=0.0 2023-06-28 19:44:50,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2131626.0, ans=0.125 2023-06-28 19:45:16,343 INFO [train.py:996] (3/4) Epoch 12, batch 19850, loss[loss=0.1392, simple_loss=0.2005, pruned_loss=0.03895, over 15930.00 frames. ], tot_loss[loss=0.205, simple_loss=0.286, pruned_loss=0.06201, over 4276536.15 frames. ], batch size: 60, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:45:18,107 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.069e+02 7.581e+02 9.843e+02 1.508e+03 3.551e+03, threshold=1.969e+03, percent-clipped=6.0 2023-06-28 19:45:19,463 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2023-06-28 19:45:20,576 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2131746.0, ans=0.125 2023-06-28 19:46:27,167 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 19:46:51,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2131986.0, ans=0.0 2023-06-28 19:46:59,322 INFO [train.py:996] (3/4) Epoch 12, batch 19900, loss[loss=0.1808, simple_loss=0.2583, pruned_loss=0.0516, over 21355.00 frames. ], tot_loss[loss=0.2041, simple_loss=0.2873, pruned_loss=0.06047, over 4278439.19 frames. ], batch size: 144, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:47:02,145 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-06-28 19:47:39,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2132106.0, ans=0.125 2023-06-28 19:48:38,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2132286.0, ans=0.0 2023-06-28 19:48:40,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2132286.0, ans=0.09899494936611666 2023-06-28 19:48:42,868 INFO [train.py:996] (3/4) Epoch 12, batch 19950, loss[loss=0.1827, simple_loss=0.2671, pruned_loss=0.04912, over 21629.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2817, pruned_loss=0.06047, over 4280963.16 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:48:44,571 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.913e+02 9.095e+02 1.320e+03 1.827e+03 2.856e+03, threshold=2.640e+03, percent-clipped=20.0 2023-06-28 19:48:57,561 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2132346.0, ans=0.2 2023-06-28 19:49:01,147 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.48 vs. limit=22.5 2023-06-28 19:49:13,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-28 19:49:20,024 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2132406.0, ans=0.0 2023-06-28 19:49:38,751 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-28 19:50:24,760 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2132646.0, ans=0.125 2023-06-28 19:50:25,855 INFO [train.py:996] (3/4) Epoch 12, batch 20000, loss[loss=0.2279, simple_loss=0.3033, pruned_loss=0.07631, over 21523.00 frames. ], tot_loss[loss=0.202, simple_loss=0.282, pruned_loss=0.06103, over 4275654.19 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 19:50:27,070 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-06-28 19:50:55,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2132706.0, ans=0.2 2023-06-28 19:52:02,597 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2132886.0, ans=0.125 2023-06-28 19:52:06,801 INFO [train.py:996] (3/4) Epoch 12, batch 20050, loss[loss=0.2364, simple_loss=0.3081, pruned_loss=0.08234, over 21767.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2848, pruned_loss=0.06305, over 4280943.48 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 19:52:08,364 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.463e+02 7.625e+02 1.079e+03 1.735e+03 4.168e+03, threshold=2.158e+03, percent-clipped=5.0 2023-06-28 19:53:44,636 INFO [train.py:996] (3/4) Epoch 12, batch 20100, loss[loss=0.228, simple_loss=0.3215, pruned_loss=0.06727, over 21817.00 frames. ], tot_loss[loss=0.2089, simple_loss=0.2876, pruned_loss=0.06511, over 4284694.39 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:54:03,318 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2133246.0, ans=0.0 2023-06-28 19:54:56,626 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2133426.0, ans=0.125 2023-06-28 19:55:01,656 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2133426.0, ans=0.125 2023-06-28 19:55:38,339 INFO [train.py:996] (3/4) Epoch 12, batch 20150, loss[loss=0.243, simple_loss=0.3218, pruned_loss=0.08209, over 21736.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2962, pruned_loss=0.06796, over 4284978.09 frames. ], batch size: 332, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:55:41,238 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.57 vs. limit=15.0 2023-06-28 19:55:41,573 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.747e+02 8.369e+02 1.261e+03 1.979e+03 4.381e+03, threshold=2.521e+03, percent-clipped=21.0 2023-06-28 19:56:32,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2133666.0, ans=0.1 2023-06-28 19:57:15,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2133786.0, ans=0.2 2023-06-28 19:57:24,630 INFO [train.py:996] (3/4) Epoch 12, batch 20200, loss[loss=0.2293, simple_loss=0.3304, pruned_loss=0.06415, over 21805.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.3024, pruned_loss=0.06972, over 4277077.38 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:57:48,205 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2133846.0, ans=0.2 2023-06-28 19:58:04,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2133906.0, ans=0.125 2023-06-28 19:58:09,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2133966.0, ans=0.125 2023-06-28 19:58:34,056 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2134026.0, ans=0.0 2023-06-28 19:59:11,630 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.92 vs. limit=15.0 2023-06-28 19:59:11,832 INFO [train.py:996] (3/4) Epoch 12, batch 20250, loss[loss=0.199, simple_loss=0.2906, pruned_loss=0.05372, over 21767.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3034, pruned_loss=0.06889, over 4279781.05 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 19:59:19,724 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.497e+02 8.759e+02 1.398e+03 2.270e+03 4.094e+03, threshold=2.796e+03, percent-clipped=18.0 2023-06-28 19:59:39,124 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2134206.0, ans=0.125 2023-06-28 20:00:16,507 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.90 vs. limit=5.0 2023-06-28 20:00:32,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-28 20:00:51,302 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2134386.0, ans=0.95 2023-06-28 20:00:53,959 INFO [train.py:996] (3/4) Epoch 12, batch 20300, loss[loss=0.1941, simple_loss=0.2784, pruned_loss=0.0549, over 21555.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.3023, pruned_loss=0.06644, over 4275965.57 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:01:45,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2134566.0, ans=0.125 2023-06-28 20:02:00,194 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:02:34,212 INFO [train.py:996] (3/4) Epoch 12, batch 20350, loss[loss=0.225, simple_loss=0.3049, pruned_loss=0.07254, over 21939.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.3006, pruned_loss=0.06633, over 4267450.67 frames. ], batch size: 372, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:02:37,271 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.278e+02 8.027e+02 1.220e+03 1.701e+03 2.990e+03, threshold=2.441e+03, percent-clipped=1.0 2023-06-28 20:03:04,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2134806.0, ans=0.0 2023-06-28 20:04:21,618 INFO [train.py:996] (3/4) Epoch 12, batch 20400, loss[loss=0.218, simple_loss=0.3011, pruned_loss=0.06749, over 21715.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.3036, pruned_loss=0.06928, over 4269434.55 frames. ], batch size: 247, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 20:04:24,644 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-28 20:04:49,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=22.5 2023-06-28 20:05:42,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135286.0, ans=0.1 2023-06-28 20:05:49,519 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-28 20:05:58,025 INFO [train.py:996] (3/4) Epoch 12, batch 20450, loss[loss=0.214, simple_loss=0.2779, pruned_loss=0.07505, over 21954.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.304, pruned_loss=0.0713, over 4267210.31 frames. ], batch size: 113, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:06:03,014 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.519e+02 7.818e+02 1.125e+03 1.970e+03 4.809e+03, threshold=2.251e+03, percent-clipped=13.0 2023-06-28 20:06:03,651 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2135346.0, ans=0.125 2023-06-28 20:06:13,825 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-28 20:06:14,881 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2135346.0, ans=0.125 2023-06-28 20:06:16,254 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2135346.0, ans=0.125 2023-06-28 20:06:31,793 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2135406.0, ans=0.125 2023-06-28 20:06:31,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2135406.0, ans=0.0 2023-06-28 20:07:00,748 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2135526.0, ans=0.125 2023-06-28 20:07:10,265 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2135526.0, ans=0.125 2023-06-28 20:07:23,378 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2135586.0, ans=0.125 2023-06-28 20:07:36,623 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135586.0, ans=0.1 2023-06-28 20:07:39,503 INFO [train.py:996] (3/4) Epoch 12, batch 20500, loss[loss=0.183, simple_loss=0.2507, pruned_loss=0.05766, over 21732.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2992, pruned_loss=0.07085, over 4265943.83 frames. ], batch size: 247, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:07:59,113 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135646.0, ans=0.1 2023-06-28 20:08:02,150 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2135706.0, ans=0.2 2023-06-28 20:08:03,655 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2135706.0, ans=0.125 2023-06-28 20:08:05,593 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135706.0, ans=0.1 2023-06-28 20:08:38,430 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-28 20:08:40,142 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-28 20:09:27,002 INFO [train.py:996] (3/4) Epoch 12, batch 20550, loss[loss=0.2623, simple_loss=0.3485, pruned_loss=0.08808, over 21514.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2927, pruned_loss=0.06915, over 4264187.67 frames. ], batch size: 509, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:09:32,109 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.925e+02 7.744e+02 1.015e+03 1.488e+03 3.056e+03, threshold=2.029e+03, percent-clipped=4.0 2023-06-28 20:09:44,099 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2136006.0, ans=0.125 2023-06-28 20:09:59,711 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2136006.0, ans=15.0 2023-06-28 20:10:33,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2136126.0, ans=0.2 2023-06-28 20:10:45,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2136186.0, ans=0.125 2023-06-28 20:11:10,573 INFO [train.py:996] (3/4) Epoch 12, batch 20600, loss[loss=0.2242, simple_loss=0.2888, pruned_loss=0.0798, over 21467.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2951, pruned_loss=0.06834, over 4264392.79 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:11:27,555 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-06-28 20:12:35,529 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2136486.0, ans=0.125 2023-06-28 20:12:45,953 INFO [train.py:996] (3/4) Epoch 12, batch 20650, loss[loss=0.1984, simple_loss=0.2632, pruned_loss=0.06678, over 21609.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2904, pruned_loss=0.06817, over 4257184.83 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:12:51,203 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.648e+02 9.695e+02 1.455e+03 2.228e+03 5.123e+03, threshold=2.910e+03, percent-clipped=30.0 2023-06-28 20:12:53,355 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2136546.0, ans=0.125 2023-06-28 20:13:39,095 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2136666.0, ans=0.0 2023-06-28 20:13:40,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2136666.0, ans=0.1 2023-06-28 20:14:17,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2136786.0, ans=0.1 2023-06-28 20:14:27,980 INFO [train.py:996] (3/4) Epoch 12, batch 20700, loss[loss=0.2813, simple_loss=0.356, pruned_loss=0.1033, over 21523.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2838, pruned_loss=0.06539, over 4253291.27 frames. ], batch size: 508, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:16:09,301 INFO [train.py:996] (3/4) Epoch 12, batch 20750, loss[loss=0.2528, simple_loss=0.3583, pruned_loss=0.07365, over 21658.00 frames. ], tot_loss[loss=0.2085, simple_loss=0.2871, pruned_loss=0.06489, over 4254387.64 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:16:13,356 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2137146.0, ans=0.125 2023-06-28 20:16:14,423 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.341e+02 7.769e+02 1.310e+03 2.249e+03 6.727e+03, threshold=2.619e+03, percent-clipped=13.0 2023-06-28 20:16:46,337 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2137206.0, ans=0.125 2023-06-28 20:17:14,331 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2137326.0, ans=0.1 2023-06-28 20:17:29,166 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-28 20:17:44,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2137386.0, ans=0.2 2023-06-28 20:17:51,072 INFO [train.py:996] (3/4) Epoch 12, batch 20800, loss[loss=0.1865, simple_loss=0.2508, pruned_loss=0.06109, over 21443.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2888, pruned_loss=0.06531, over 4265266.59 frames. ], batch size: 195, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 20:18:52,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2137566.0, ans=0.0 2023-06-28 20:19:04,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2137626.0, ans=0.125 2023-06-28 20:19:18,786 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2137686.0, ans=0.0 2023-06-28 20:19:33,044 INFO [train.py:996] (3/4) Epoch 12, batch 20850, loss[loss=0.1513, simple_loss=0.2257, pruned_loss=0.03851, over 21461.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.2811, pruned_loss=0.0629, over 4261865.45 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:19:39,613 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.878e+02 7.517e+02 1.058e+03 1.433e+03 3.063e+03, threshold=2.117e+03, percent-clipped=2.0 2023-06-28 20:19:41,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=2137746.0, ans=0.05 2023-06-28 20:20:46,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2137926.0, ans=0.125 2023-06-28 20:21:10,312 INFO [train.py:996] (3/4) Epoch 12, batch 20900, loss[loss=0.1947, simple_loss=0.2713, pruned_loss=0.05903, over 21512.00 frames. ], tot_loss[loss=0.2056, simple_loss=0.2828, pruned_loss=0.0642, over 4269580.71 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:21:26,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2138106.0, ans=0.2 2023-06-28 20:21:55,575 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-28 20:22:29,333 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-28 20:22:48,747 INFO [train.py:996] (3/4) Epoch 12, batch 20950, loss[loss=0.1785, simple_loss=0.2517, pruned_loss=0.05269, over 21364.00 frames. ], tot_loss[loss=0.2011, simple_loss=0.2799, pruned_loss=0.06119, over 4264270.67 frames. ], batch size: 194, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:22:55,236 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.730e+02 8.164e+02 1.366e+03 2.074e+03 5.785e+03, threshold=2.733e+03, percent-clipped=24.0 2023-06-28 20:22:59,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2138346.0, ans=0.0 2023-06-28 20:23:33,436 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:24:00,848 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2138526.0, ans=0.0 2023-06-28 20:24:24,228 INFO [train.py:996] (3/4) Epoch 12, batch 21000, loss[loss=0.2212, simple_loss=0.3069, pruned_loss=0.0678, over 21882.00 frames. ], tot_loss[loss=0.2019, simple_loss=0.2798, pruned_loss=0.06196, over 4267037.76 frames. ], batch size: 107, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:24:24,229 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 20:24:40,724 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2646, simple_loss=0.357, pruned_loss=0.08608, over 1796401.00 frames. 2023-06-28 20:24:40,725 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 20:24:59,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2138646.0, ans=0.125 2023-06-28 20:25:20,241 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2138706.0, ans=0.0 2023-06-28 20:25:47,038 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-28 20:25:53,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=2138826.0, ans=0.05 2023-06-28 20:26:01,047 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2138886.0, ans=0.0 2023-06-28 20:26:09,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-28 20:26:13,913 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2138886.0, ans=0.1 2023-06-28 20:26:21,495 INFO [train.py:996] (3/4) Epoch 12, batch 21050, loss[loss=0.1837, simple_loss=0.2596, pruned_loss=0.05395, over 21814.00 frames. ], tot_loss[loss=0.2009, simple_loss=0.2781, pruned_loss=0.06189, over 4270537.77 frames. ], batch size: 118, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:26:28,201 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.708e+02 6.795e+02 9.340e+02 1.308e+03 3.165e+03, threshold=1.868e+03, percent-clipped=2.0 2023-06-28 20:26:30,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2138946.0, ans=0.125 2023-06-28 20:26:57,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2139006.0, ans=0.1 2023-06-28 20:27:31,443 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:27:37,696 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2139126.0, ans=0.0 2023-06-28 20:27:43,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2139186.0, ans=0.125 2023-06-28 20:28:01,137 INFO [train.py:996] (3/4) Epoch 12, batch 21100, loss[loss=0.2055, simple_loss=0.2708, pruned_loss=0.0701, over 21380.00 frames. ], tot_loss[loss=0.199, simple_loss=0.2745, pruned_loss=0.06174, over 4260154.31 frames. ], batch size: 160, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:28:03,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2139246.0, ans=0.125 2023-06-28 20:28:55,227 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2139366.0, ans=0.0 2023-06-28 20:28:58,787 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2139426.0, ans=0.0 2023-06-28 20:29:42,287 INFO [train.py:996] (3/4) Epoch 12, batch 21150, loss[loss=0.2123, simple_loss=0.2599, pruned_loss=0.08236, over 21528.00 frames. ], tot_loss[loss=0.1967, simple_loss=0.2695, pruned_loss=0.06194, over 4260106.44 frames. ], batch size: 512, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:29:46,033 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2139546.0, ans=0.0 2023-06-28 20:29:50,623 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.872e+02 8.259e+02 1.205e+03 1.749e+03 3.220e+03, threshold=2.410e+03, percent-clipped=20.0 2023-06-28 20:29:59,039 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2139546.0, ans=0.125 2023-06-28 20:30:09,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2139606.0, ans=0.1 2023-06-28 20:30:10,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2139606.0, ans=0.0 2023-06-28 20:30:26,997 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2139666.0, ans=0.125 2023-06-28 20:30:33,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2139666.0, ans=0.0 2023-06-28 20:30:56,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2139726.0, ans=0.1 2023-06-28 20:31:08,534 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-06-28 20:31:23,258 INFO [train.py:996] (3/4) Epoch 12, batch 21200, loss[loss=0.1905, simple_loss=0.2602, pruned_loss=0.06038, over 21576.00 frames. ], tot_loss[loss=0.1942, simple_loss=0.2666, pruned_loss=0.06094, over 4255956.23 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:31:41,878 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-28 20:32:03,397 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.28 vs. limit=15.0 2023-06-28 20:33:02,597 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-28 20:33:04,753 INFO [train.py:996] (3/4) Epoch 12, batch 21250, loss[loss=0.2105, simple_loss=0.2873, pruned_loss=0.06688, over 21400.00 frames. ], tot_loss[loss=0.1931, simple_loss=0.2645, pruned_loss=0.06082, over 4252809.29 frames. ], batch size: 194, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:33:07,260 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2140146.0, ans=0.125 2023-06-28 20:33:13,142 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.761e+02 7.355e+02 9.747e+02 1.370e+03 2.666e+03, threshold=1.949e+03, percent-clipped=4.0 2023-06-28 20:33:15,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2140146.0, ans=0.0 2023-06-28 20:33:25,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2140206.0, ans=0.125 2023-06-28 20:34:22,393 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-28 20:34:23,476 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2140326.0, ans=0.0 2023-06-28 20:34:24,119 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-06-28 20:34:47,105 INFO [train.py:996] (3/4) Epoch 12, batch 21300, loss[loss=0.2124, simple_loss=0.292, pruned_loss=0.06638, over 21482.00 frames. ], tot_loss[loss=0.198, simple_loss=0.2708, pruned_loss=0.06263, over 4259679.68 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:35:12,442 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2140506.0, ans=0.0 2023-06-28 20:35:37,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2140566.0, ans=0.125 2023-06-28 20:35:48,796 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2140566.0, ans=0.1 2023-06-28 20:35:56,434 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.06 vs. limit=10.0 2023-06-28 20:36:22,397 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2140686.0, ans=0.0 2023-06-28 20:36:29,985 INFO [train.py:996] (3/4) Epoch 12, batch 21350, loss[loss=0.1821, simple_loss=0.2702, pruned_loss=0.04702, over 21412.00 frames. ], tot_loss[loss=0.2014, simple_loss=0.2755, pruned_loss=0.06371, over 4258364.70 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:36:43,163 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.041e+02 8.389e+02 1.153e+03 1.810e+03 4.461e+03, threshold=2.306e+03, percent-clipped=20.0 2023-06-28 20:37:10,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2140806.0, ans=0.125 2023-06-28 20:37:21,640 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2140866.0, ans=0.0 2023-06-28 20:37:37,419 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-28 20:38:16,930 INFO [train.py:996] (3/4) Epoch 12, batch 21400, loss[loss=0.2576, simple_loss=0.3403, pruned_loss=0.08749, over 21813.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.279, pruned_loss=0.06284, over 4262860.21 frames. ], batch size: 118, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:38:40,352 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2141106.0, ans=0.125 2023-06-28 20:38:59,506 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:39:11,901 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.38 vs. limit=10.0 2023-06-28 20:39:18,277 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.40 vs. limit=15.0 2023-06-28 20:39:49,995 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2141286.0, ans=0.07 2023-06-28 20:39:57,095 INFO [train.py:996] (3/4) Epoch 12, batch 21450, loss[loss=0.198, simple_loss=0.272, pruned_loss=0.062, over 21676.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2819, pruned_loss=0.06422, over 4272128.61 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:40:04,992 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.076e+02 7.437e+02 1.005e+03 1.722e+03 2.921e+03, threshold=2.009e+03, percent-clipped=6.0 2023-06-28 20:40:58,977 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2141526.0, ans=0.0 2023-06-28 20:41:12,281 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=22.5 2023-06-28 20:41:18,855 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2141586.0, ans=0.2 2023-06-28 20:41:38,437 INFO [train.py:996] (3/4) Epoch 12, batch 21500, loss[loss=0.1913, simple_loss=0.2545, pruned_loss=0.06403, over 21611.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2803, pruned_loss=0.06565, over 4269587.10 frames. ], batch size: 231, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:42:00,814 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2141706.0, ans=0.125 2023-06-28 20:42:20,446 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-28 20:42:33,206 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2141766.0, ans=0.125 2023-06-28 20:43:13,466 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2141886.0, ans=0.125 2023-06-28 20:43:13,541 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2141886.0, ans=0.09899494936611666 2023-06-28 20:43:19,796 INFO [train.py:996] (3/4) Epoch 12, batch 21550, loss[loss=0.1784, simple_loss=0.2479, pruned_loss=0.05441, over 21484.00 frames. ], tot_loss[loss=0.1995, simple_loss=0.2729, pruned_loss=0.06309, over 4276872.30 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:43:32,812 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 7.462e+02 9.978e+02 1.500e+03 2.892e+03, threshold=1.996e+03, percent-clipped=12.0 2023-06-28 20:43:54,901 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2142006.0, ans=0.0 2023-06-28 20:45:03,645 INFO [train.py:996] (3/4) Epoch 12, batch 21600, loss[loss=0.1837, simple_loss=0.2592, pruned_loss=0.05411, over 21582.00 frames. ], tot_loss[loss=0.1957, simple_loss=0.2685, pruned_loss=0.06144, over 4272464.24 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 32.0 2023-06-28 20:45:21,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.45 vs. limit=10.0 2023-06-28 20:45:30,210 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2142306.0, ans=0.1 2023-06-28 20:45:51,876 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2142366.0, ans=0.125 2023-06-28 20:46:20,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-28 20:46:26,178 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2142486.0, ans=0.125 2023-06-28 20:46:51,880 INFO [train.py:996] (3/4) Epoch 12, batch 21650, loss[loss=0.1861, simple_loss=0.2583, pruned_loss=0.05699, over 21815.00 frames. ], tot_loss[loss=0.1966, simple_loss=0.2739, pruned_loss=0.05963, over 4270113.26 frames. ], batch size: 98, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:47:03,116 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.132e+02 8.434e+02 1.336e+03 2.286e+03 3.969e+03, threshold=2.673e+03, percent-clipped=30.0 2023-06-28 20:47:03,715 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2142546.0, ans=0.07 2023-06-28 20:47:18,185 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:47:25,270 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2142606.0, ans=0.125 2023-06-28 20:47:35,551 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.78 vs. limit=15.0 2023-06-28 20:47:35,712 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-06-28 20:48:26,748 INFO [train.py:996] (3/4) Epoch 12, batch 21700, loss[loss=0.1837, simple_loss=0.2767, pruned_loss=0.04534, over 21669.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2746, pruned_loss=0.05848, over 4265319.14 frames. ], batch size: 230, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:48:47,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-28 20:49:05,544 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2142906.0, ans=0.125 2023-06-28 20:49:17,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2142966.0, ans=0.1 2023-06-28 20:49:50,838 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2143086.0, ans=0.125 2023-06-28 20:49:58,479 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2143086.0, ans=0.125 2023-06-28 20:50:07,592 INFO [train.py:996] (3/4) Epoch 12, batch 21750, loss[loss=0.192, simple_loss=0.266, pruned_loss=0.05905, over 21306.00 frames. ], tot_loss[loss=0.195, simple_loss=0.2711, pruned_loss=0.05942, over 4273762.12 frames. ], batch size: 144, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:50:08,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2143146.0, ans=0.0 2023-06-28 20:50:24,253 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.270e+02 7.010e+02 1.001e+03 1.482e+03 3.293e+03, threshold=2.002e+03, percent-clipped=2.0 2023-06-28 20:50:31,636 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 20:51:42,956 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-28 20:51:54,917 INFO [train.py:996] (3/4) Epoch 12, batch 21800, loss[loss=0.2338, simple_loss=0.3114, pruned_loss=0.07812, over 21668.00 frames. ], tot_loss[loss=0.1946, simple_loss=0.2693, pruned_loss=0.05996, over 4272451.79 frames. ], batch size: 415, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:52:24,105 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2143506.0, ans=0.125 2023-06-28 20:52:30,836 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2143506.0, ans=0.0 2023-06-28 20:52:37,530 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2143566.0, ans=0.125 2023-06-28 20:52:52,271 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2143566.0, ans=0.125 2023-06-28 20:52:58,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2143626.0, ans=0.125 2023-06-28 20:53:37,018 INFO [train.py:996] (3/4) Epoch 12, batch 21850, loss[loss=0.1977, simple_loss=0.2759, pruned_loss=0.05972, over 21551.00 frames. ], tot_loss[loss=0.1996, simple_loss=0.2779, pruned_loss=0.06062, over 4275802.69 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:53:48,649 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.108e+02 8.276e+02 1.227e+03 1.863e+03 4.037e+03, threshold=2.455e+03, percent-clipped=20.0 2023-06-28 20:54:50,714 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2143926.0, ans=0.0 2023-06-28 20:55:11,108 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=15.0 2023-06-28 20:55:18,320 INFO [train.py:996] (3/4) Epoch 12, batch 21900, loss[loss=0.1948, simple_loss=0.2621, pruned_loss=0.06378, over 21750.00 frames. ], tot_loss[loss=0.2002, simple_loss=0.2771, pruned_loss=0.06165, over 4272331.92 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:56:13,494 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2144166.0, ans=0.1 2023-06-28 20:56:41,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2144286.0, ans=0.125 2023-06-28 20:56:58,121 INFO [train.py:996] (3/4) Epoch 12, batch 21950, loss[loss=0.1471, simple_loss=0.2282, pruned_loss=0.033, over 21521.00 frames. ], tot_loss[loss=0.1958, simple_loss=0.2707, pruned_loss=0.06045, over 4266395.18 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 8.0 2023-06-28 20:57:09,570 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.558e+02 7.761e+02 1.147e+03 1.869e+03 4.092e+03, threshold=2.294e+03, percent-clipped=9.0 2023-06-28 20:57:32,862 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2144406.0, ans=0.125 2023-06-28 20:57:44,154 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2144466.0, ans=0.1 2023-06-28 20:57:52,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2144466.0, ans=0.125 2023-06-28 20:58:40,386 INFO [train.py:996] (3/4) Epoch 12, batch 22000, loss[loss=0.1642, simple_loss=0.2294, pruned_loss=0.04953, over 21265.00 frames. ], tot_loss[loss=0.1894, simple_loss=0.2646, pruned_loss=0.05712, over 4254791.53 frames. ], batch size: 551, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 20:58:46,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2144646.0, ans=0.1 2023-06-28 20:59:22,106 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2144766.0, ans=0.125 2023-06-28 20:59:55,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2144826.0, ans=0.125 2023-06-28 21:00:03,370 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2144886.0, ans=0.1 2023-06-28 21:00:23,762 INFO [train.py:996] (3/4) Epoch 12, batch 22050, loss[loss=0.2388, simple_loss=0.3223, pruned_loss=0.07769, over 21787.00 frames. ], tot_loss[loss=0.1956, simple_loss=0.2727, pruned_loss=0.05927, over 4260267.52 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:00:24,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-28 21:00:40,615 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 7.125e+02 1.182e+03 1.630e+03 4.961e+03, threshold=2.364e+03, percent-clipped=13.0 2023-06-28 21:00:44,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2145006.0, ans=0.125 2023-06-28 21:00:49,371 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2145006.0, ans=0.2 2023-06-28 21:00:52,636 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2145006.0, ans=0.125 2023-06-28 21:01:38,737 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2145126.0, ans=0.125 2023-06-28 21:01:53,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2145186.0, ans=0.125 2023-06-28 21:01:56,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2145186.0, ans=0.125 2023-06-28 21:02:06,002 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.94 vs. limit=10.0 2023-06-28 21:02:06,217 INFO [train.py:996] (3/4) Epoch 12, batch 22100, loss[loss=0.2301, simple_loss=0.3113, pruned_loss=0.07446, over 21871.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2812, pruned_loss=0.06345, over 4256845.24 frames. ], batch size: 124, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:02:18,273 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2145246.0, ans=0.0 2023-06-28 21:02:19,985 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2145246.0, ans=0.0 2023-06-28 21:02:48,194 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2145366.0, ans=0.2 2023-06-28 21:03:47,947 INFO [train.py:996] (3/4) Epoch 12, batch 22150, loss[loss=0.1991, simple_loss=0.2799, pruned_loss=0.05915, over 21434.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.2832, pruned_loss=0.06533, over 4263947.80 frames. ], batch size: 177, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:04:04,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.507e+02 8.832e+02 1.298e+03 1.809e+03 3.590e+03, threshold=2.596e+03, percent-clipped=11.0 2023-06-28 21:04:19,903 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=12.0 2023-06-28 21:05:04,659 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.73 vs. limit=10.0 2023-06-28 21:05:11,818 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2145786.0, ans=0.125 2023-06-28 21:05:16,701 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2145786.0, ans=0.125 2023-06-28 21:05:29,495 INFO [train.py:996] (3/4) Epoch 12, batch 22200, loss[loss=0.2119, simple_loss=0.3049, pruned_loss=0.05945, over 21797.00 frames. ], tot_loss[loss=0.209, simple_loss=0.2857, pruned_loss=0.0661, over 4265670.07 frames. ], batch size: 282, lr: 2.40e-03, grad_scale: 16.0 2023-06-28 21:06:28,552 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2145966.0, ans=0.125 2023-06-28 21:07:17,284 INFO [train.py:996] (3/4) Epoch 12, batch 22250, loss[loss=0.2197, simple_loss=0.3135, pruned_loss=0.06297, over 17134.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2938, pruned_loss=0.06789, over 4271065.04 frames. ], batch size: 60, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:07:29,282 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.277e+02 8.206e+02 1.186e+03 1.604e+03 3.301e+03, threshold=2.372e+03, percent-clipped=3.0 2023-06-28 21:07:44,010 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2146206.0, ans=0.125 2023-06-28 21:08:13,411 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-06-28 21:08:37,110 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2146386.0, ans=0.2 2023-06-28 21:08:57,730 INFO [train.py:996] (3/4) Epoch 12, batch 22300, loss[loss=0.2082, simple_loss=0.2843, pruned_loss=0.0661, over 21931.00 frames. ], tot_loss[loss=0.217, simple_loss=0.295, pruned_loss=0.06947, over 4268214.21 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:09:00,241 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.38 vs. limit=10.0 2023-06-28 21:09:33,860 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2146566.0, ans=0.1 2023-06-28 21:10:36,520 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=15.0 2023-06-28 21:10:38,565 INFO [train.py:996] (3/4) Epoch 12, batch 22350, loss[loss=0.1976, simple_loss=0.2686, pruned_loss=0.06325, over 21589.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2924, pruned_loss=0.06981, over 4279163.33 frames. ], batch size: 212, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:10:42,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2146746.0, ans=0.0 2023-06-28 21:10:49,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2146746.0, ans=0.035 2023-06-28 21:10:50,160 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.715e+02 7.662e+02 1.007e+03 1.656e+03 3.932e+03, threshold=2.013e+03, percent-clipped=14.0 2023-06-28 21:10:59,058 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2146806.0, ans=0.125 2023-06-28 21:11:06,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2146806.0, ans=0.0 2023-06-28 21:11:23,669 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2146866.0, ans=0.0 2023-06-28 21:11:35,221 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2146866.0, ans=0.1 2023-06-28 21:12:14,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2146986.0, ans=0.125 2023-06-28 21:12:14,307 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2146986.0, ans=0.2 2023-06-28 21:12:20,271 INFO [train.py:996] (3/4) Epoch 12, batch 22400, loss[loss=0.2011, simple_loss=0.2776, pruned_loss=0.06233, over 21642.00 frames. ], tot_loss[loss=0.2111, simple_loss=0.2895, pruned_loss=0.06631, over 4282934.70 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:12:49,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2147106.0, ans=0.0 2023-06-28 21:13:22,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2147226.0, ans=0.125 2023-06-28 21:13:30,244 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2147226.0, ans=0.0 2023-06-28 21:13:33,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2147226.0, ans=0.1 2023-06-28 21:13:36,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2147226.0, ans=0.0 2023-06-28 21:14:05,221 INFO [train.py:996] (3/4) Epoch 12, batch 22450, loss[loss=0.1825, simple_loss=0.255, pruned_loss=0.05494, over 21671.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.283, pruned_loss=0.06543, over 4285615.74 frames. ], batch size: 333, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:14:18,835 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.767e+02 6.974e+02 9.708e+02 1.486e+03 4.519e+03, threshold=1.942e+03, percent-clipped=14.0 2023-06-28 21:14:50,568 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:15:15,008 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2147526.0, ans=15.0 2023-06-28 21:15:48,355 INFO [train.py:996] (3/4) Epoch 12, batch 22500, loss[loss=0.2069, simple_loss=0.3039, pruned_loss=0.05492, over 21396.00 frames. ], tot_loss[loss=0.2059, simple_loss=0.2802, pruned_loss=0.06574, over 4275249.31 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:17:31,316 INFO [train.py:996] (3/4) Epoch 12, batch 22550, loss[loss=0.2113, simple_loss=0.2851, pruned_loss=0.06872, over 21278.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2829, pruned_loss=0.06589, over 4280058.47 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:17:49,757 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.593e+02 9.385e+02 1.394e+03 1.973e+03 3.224e+03, threshold=2.788e+03, percent-clipped=25.0 2023-06-28 21:18:02,951 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2148006.0, ans=0.125 2023-06-28 21:18:04,671 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2148006.0, ans=0.125 2023-06-28 21:19:20,500 INFO [train.py:996] (3/4) Epoch 12, batch 22600, loss[loss=0.2067, simple_loss=0.2934, pruned_loss=0.05995, over 21688.00 frames. ], tot_loss[loss=0.2088, simple_loss=0.2857, pruned_loss=0.06596, over 4285948.46 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:19:24,601 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2148246.0, ans=0.0 2023-06-28 21:19:33,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=22.5 2023-06-28 21:19:34,933 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2148246.0, ans=0.1 2023-06-28 21:19:35,374 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-28 21:19:36,804 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2148306.0, ans=0.2 2023-06-28 21:20:12,142 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=2148366.0, ans=0.05 2023-06-28 21:20:43,995 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.00 vs. limit=22.5 2023-06-28 21:21:01,905 INFO [train.py:996] (3/4) Epoch 12, batch 22650, loss[loss=0.2268, simple_loss=0.3427, pruned_loss=0.05545, over 19752.00 frames. ], tot_loss[loss=0.2063, simple_loss=0.2817, pruned_loss=0.06542, over 4268255.63 frames. ], batch size: 703, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:21:10,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2148546.0, ans=0.2 2023-06-28 21:21:10,731 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2148546.0, ans=0.2 2023-06-28 21:21:14,879 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.226e+02 9.650e+02 1.395e+03 1.973e+03 4.081e+03, threshold=2.791e+03, percent-clipped=9.0 2023-06-28 21:21:39,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2148666.0, ans=0.2 2023-06-28 21:21:54,091 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2148666.0, ans=0.1 2023-06-28 21:22:05,450 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2148726.0, ans=0.0 2023-06-28 21:22:41,739 INFO [train.py:996] (3/4) Epoch 12, batch 22700, loss[loss=0.2049, simple_loss=0.2765, pruned_loss=0.06665, over 21870.00 frames. ], tot_loss[loss=0.2024, simple_loss=0.2754, pruned_loss=0.06469, over 4266734.12 frames. ], batch size: 107, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:22:45,930 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2148846.0, ans=0.0 2023-06-28 21:22:54,887 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.81 vs. limit=10.0 2023-06-28 21:23:48,819 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2149026.0, ans=0.125 2023-06-28 21:24:15,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-28 21:24:24,380 INFO [train.py:996] (3/4) Epoch 12, batch 22750, loss[loss=0.2126, simple_loss=0.2899, pruned_loss=0.06765, over 21744.00 frames. ], tot_loss[loss=0.2044, simple_loss=0.276, pruned_loss=0.06638, over 4251647.46 frames. ], batch size: 113, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:24:37,891 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.797e+02 7.718e+02 1.201e+03 1.681e+03 3.626e+03, threshold=2.402e+03, percent-clipped=4.0 2023-06-28 21:25:12,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2149266.0, ans=0.125 2023-06-28 21:25:28,932 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2149326.0, ans=0.125 2023-06-28 21:25:30,772 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2149326.0, ans=0.125 2023-06-28 21:25:32,229 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2149326.0, ans=0.0 2023-06-28 21:25:47,040 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2149326.0, ans=0.125 2023-06-28 21:26:03,799 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-28 21:26:05,771 INFO [train.py:996] (3/4) Epoch 12, batch 22800, loss[loss=0.2243, simple_loss=0.2916, pruned_loss=0.07855, over 21241.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2804, pruned_loss=0.06816, over 4259374.23 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:26:06,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2149446.0, ans=0.1 2023-06-28 21:26:51,516 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2149566.0, ans=0.5 2023-06-28 21:27:45,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2149746.0, ans=0.125 2023-06-28 21:27:45,950 INFO [train.py:996] (3/4) Epoch 12, batch 22850, loss[loss=0.2473, simple_loss=0.2882, pruned_loss=0.1032, over 21391.00 frames. ], tot_loss[loss=0.206, simple_loss=0.2768, pruned_loss=0.0676, over 4269043.07 frames. ], batch size: 508, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:27:58,606 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2149746.0, ans=0.2 2023-06-28 21:28:01,317 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.804e+02 7.642e+02 1.050e+03 1.882e+03 3.484e+03, threshold=2.099e+03, percent-clipped=13.0 2023-06-28 21:28:18,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2149806.0, ans=0.0 2023-06-28 21:28:35,966 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.13 vs. limit=10.0 2023-06-28 21:28:53,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2149926.0, ans=0.125 2023-06-28 21:29:30,152 INFO [train.py:996] (3/4) Epoch 12, batch 22900, loss[loss=0.2823, simple_loss=0.3689, pruned_loss=0.09783, over 21458.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2799, pruned_loss=0.0668, over 4259823.48 frames. ], batch size: 507, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:29:41,137 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2150046.0, ans=0.125 2023-06-28 21:30:41,252 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2150226.0, ans=0.125 2023-06-28 21:31:03,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2150286.0, ans=0.1 2023-06-28 21:31:19,856 INFO [train.py:996] (3/4) Epoch 12, batch 22950, loss[loss=0.2051, simple_loss=0.3165, pruned_loss=0.04684, over 21616.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2934, pruned_loss=0.0658, over 4256883.42 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:31:39,647 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.589e+02 9.756e+02 1.509e+03 2.315e+03 4.900e+03, threshold=3.017e+03, percent-clipped=30.0 2023-06-28 21:31:43,682 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2150406.0, ans=0.05 2023-06-28 21:32:02,081 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2150466.0, ans=0.0 2023-06-28 21:32:35,349 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-06-28 21:33:02,909 INFO [train.py:996] (3/4) Epoch 12, batch 23000, loss[loss=0.1883, simple_loss=0.2683, pruned_loss=0.05414, over 21639.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2928, pruned_loss=0.06376, over 4254452.27 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:33:15,871 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-06-28 21:33:18,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2150646.0, ans=0.0 2023-06-28 21:34:16,633 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2150826.0, ans=0.0 2023-06-28 21:34:31,697 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2150886.0, ans=0.125 2023-06-28 21:34:51,526 INFO [train.py:996] (3/4) Epoch 12, batch 23050, loss[loss=0.2676, simple_loss=0.3336, pruned_loss=0.1008, over 21501.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2941, pruned_loss=0.06631, over 4259265.93 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:35:02,075 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:35:05,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2150946.0, ans=0.125 2023-06-28 21:35:10,951 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.719e+02 9.558e+02 1.419e+03 1.890e+03 3.669e+03, threshold=2.838e+03, percent-clipped=6.0 2023-06-28 21:35:11,869 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2151006.0, ans=0.2 2023-06-28 21:35:13,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2151006.0, ans=0.0 2023-06-28 21:35:30,248 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-28 21:35:38,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2151066.0, ans=0.05 2023-06-28 21:35:45,917 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-28 21:36:31,612 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=2151186.0, ans=0.025 2023-06-28 21:36:34,599 INFO [train.py:996] (3/4) Epoch 12, batch 23100, loss[loss=0.1898, simple_loss=0.2541, pruned_loss=0.06279, over 21161.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2897, pruned_loss=0.0666, over 4268561.24 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:36:55,412 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-28 21:37:01,938 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2151306.0, ans=0.1 2023-06-28 21:38:16,275 INFO [train.py:996] (3/4) Epoch 12, batch 23150, loss[loss=0.1854, simple_loss=0.2562, pruned_loss=0.0573, over 21572.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2836, pruned_loss=0.06592, over 4264380.54 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:38:16,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2151546.0, ans=0.125 2023-06-28 21:38:26,486 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2151546.0, ans=0.125 2023-06-28 21:38:30,888 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.983e+02 7.198e+02 1.006e+03 1.345e+03 2.860e+03, threshold=2.012e+03, percent-clipped=2.0 2023-06-28 21:38:47,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2151606.0, ans=0.0 2023-06-28 21:39:57,502 INFO [train.py:996] (3/4) Epoch 12, batch 23200, loss[loss=0.2126, simple_loss=0.2747, pruned_loss=0.0753, over 21593.00 frames. ], tot_loss[loss=0.208, simple_loss=0.283, pruned_loss=0.06647, over 4266188.17 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:39:59,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2151846.0, ans=0.125 2023-06-28 21:39:59,886 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2151846.0, ans=0.125 2023-06-28 21:40:16,710 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=22.5 2023-06-28 21:40:56,172 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2152026.0, ans=0.125 2023-06-28 21:41:38,929 INFO [train.py:996] (3/4) Epoch 12, batch 23250, loss[loss=0.2229, simple_loss=0.29, pruned_loss=0.0779, over 21911.00 frames. ], tot_loss[loss=0.2098, simple_loss=0.2846, pruned_loss=0.06751, over 4273453.21 frames. ], batch size: 371, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:41:52,728 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2152146.0, ans=0.0 2023-06-28 21:41:58,600 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.316e+02 9.370e+02 1.450e+03 2.114e+03 3.490e+03, threshold=2.900e+03, percent-clipped=30.0 2023-06-28 21:42:07,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2152206.0, ans=0.0 2023-06-28 21:42:27,621 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2152266.0, ans=0.2 2023-06-28 21:43:22,221 INFO [train.py:996] (3/4) Epoch 12, batch 23300, loss[loss=0.2226, simple_loss=0.3258, pruned_loss=0.05971, over 21277.00 frames. ], tot_loss[loss=0.2155, simple_loss=0.2923, pruned_loss=0.06933, over 4282257.61 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 21:43:51,495 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-28 21:43:55,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2152506.0, ans=0.2 2023-06-28 21:43:57,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2152506.0, ans=0.1 2023-06-28 21:44:03,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2152566.0, ans=0.125 2023-06-28 21:44:55,439 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2152686.0, ans=0.125 2023-06-28 21:45:02,457 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-28 21:45:09,870 INFO [train.py:996] (3/4) Epoch 12, batch 23350, loss[loss=0.1663, simple_loss=0.2565, pruned_loss=0.03805, over 21732.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2963, pruned_loss=0.06797, over 4276862.18 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:45:32,703 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-28 21:45:33,277 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.157e+02 1.010e+03 1.481e+03 2.093e+03 4.806e+03, threshold=2.962e+03, percent-clipped=5.0 2023-06-28 21:45:42,611 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.88 vs. limit=22.5 2023-06-28 21:45:43,508 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2152806.0, ans=0.0 2023-06-28 21:46:34,107 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2152986.0, ans=0.1 2023-06-28 21:46:48,695 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2152986.0, ans=0.125 2023-06-28 21:46:51,348 INFO [train.py:996] (3/4) Epoch 12, batch 23400, loss[loss=0.1652, simple_loss=0.2677, pruned_loss=0.03134, over 20765.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.29, pruned_loss=0.06485, over 4280057.94 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:47:33,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2153166.0, ans=0.2 2023-06-28 21:47:34,619 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2153166.0, ans=0.0 2023-06-28 21:47:40,094 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=12.0 2023-06-28 21:48:38,236 INFO [train.py:996] (3/4) Epoch 12, batch 23450, loss[loss=0.2295, simple_loss=0.2968, pruned_loss=0.08113, over 21381.00 frames. ], tot_loss[loss=0.2127, simple_loss=0.2908, pruned_loss=0.06726, over 4287692.33 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:48:56,421 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.004e+02 7.180e+02 1.083e+03 1.740e+03 4.594e+03, threshold=2.165e+03, percent-clipped=4.0 2023-06-28 21:49:06,172 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-28 21:49:23,977 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-28 21:49:45,585 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-28 21:49:46,521 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2153526.0, ans=0.125 2023-06-28 21:49:49,694 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2153526.0, ans=0.125 2023-06-28 21:50:06,251 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-28 21:50:19,186 INFO [train.py:996] (3/4) Epoch 12, batch 23500, loss[loss=0.2051, simple_loss=0.2793, pruned_loss=0.06546, over 21835.00 frames. ], tot_loss[loss=0.2132, simple_loss=0.2897, pruned_loss=0.06835, over 4292743.34 frames. ], batch size: 298, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:50:19,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2153646.0, ans=0.0 2023-06-28 21:50:23,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-28 21:50:37,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2153706.0, ans=0.1 2023-06-28 21:50:38,194 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-28 21:51:46,717 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2153886.0, ans=0.125 2023-06-28 21:51:56,097 INFO [train.py:996] (3/4) Epoch 12, batch 23550, loss[loss=0.2069, simple_loss=0.2746, pruned_loss=0.06958, over 21780.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2844, pruned_loss=0.06796, over 4295115.75 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 21:52:06,208 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2153946.0, ans=0.1 2023-06-28 21:52:18,946 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.122e+02 7.386e+02 1.223e+03 1.985e+03 5.110e+03, threshold=2.446e+03, percent-clipped=21.0 2023-06-28 21:52:21,741 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=12.0 2023-06-28 21:53:02,929 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2154126.0, ans=0.125 2023-06-28 21:53:28,118 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-28 21:53:40,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2154186.0, ans=0.2 2023-06-28 21:53:43,333 INFO [train.py:996] (3/4) Epoch 12, batch 23600, loss[loss=0.2208, simple_loss=0.2928, pruned_loss=0.07436, over 21274.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2856, pruned_loss=0.06784, over 4288385.17 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:53:45,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2154246.0, ans=0.015 2023-06-28 21:53:52,297 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2154246.0, ans=0.125 2023-06-28 21:55:26,536 INFO [train.py:996] (3/4) Epoch 12, batch 23650, loss[loss=0.2242, simple_loss=0.3033, pruned_loss=0.07256, over 21459.00 frames. ], tot_loss[loss=0.2096, simple_loss=0.2863, pruned_loss=0.06648, over 4287407.28 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:55:50,245 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.631e+02 9.498e+02 1.627e+03 2.545e+03 5.743e+03, threshold=3.254e+03, percent-clipped=28.0 2023-06-28 21:56:27,290 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2154666.0, ans=0.125 2023-06-28 21:56:39,357 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-28 21:56:40,263 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2154726.0, ans=0.125 2023-06-28 21:56:46,071 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-28 21:56:57,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2154786.0, ans=0.05 2023-06-28 21:57:10,335 INFO [train.py:996] (3/4) Epoch 12, batch 23700, loss[loss=0.2175, simple_loss=0.3081, pruned_loss=0.06346, over 19876.00 frames. ], tot_loss[loss=0.2103, simple_loss=0.2884, pruned_loss=0.0661, over 4284904.69 frames. ], batch size: 704, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:58:58,951 INFO [train.py:996] (3/4) Epoch 12, batch 23750, loss[loss=0.2065, simple_loss=0.3054, pruned_loss=0.05378, over 20662.00 frames. ], tot_loss[loss=0.2126, simple_loss=0.2916, pruned_loss=0.06678, over 4282788.44 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 21:59:20,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2155206.0, ans=0.1 2023-06-28 21:59:21,758 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.921e+02 7.434e+02 9.463e+02 1.338e+03 4.159e+03, threshold=1.893e+03, percent-clipped=3.0 2023-06-28 21:59:47,070 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2155266.0, ans=0.0 2023-06-28 22:00:07,001 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2155326.0, ans=0.1 2023-06-28 22:00:19,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2155386.0, ans=0.1 2023-06-28 22:00:26,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2155386.0, ans=0.125 2023-06-28 22:00:47,741 INFO [train.py:996] (3/4) Epoch 12, batch 23800, loss[loss=0.2576, simple_loss=0.3563, pruned_loss=0.07941, over 21616.00 frames. ], tot_loss[loss=0.2091, simple_loss=0.2887, pruned_loss=0.06474, over 4282200.83 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:01:27,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2155506.0, ans=0.125 2023-06-28 22:01:28,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2155506.0, ans=0.125 2023-06-28 22:01:44,144 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2155566.0, ans=0.0 2023-06-28 22:02:36,563 INFO [train.py:996] (3/4) Epoch 12, batch 23850, loss[loss=0.256, simple_loss=0.3685, pruned_loss=0.07172, over 19788.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.298, pruned_loss=0.06634, over 4280439.48 frames. ], batch size: 702, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:03:01,487 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 9.558e+02 1.642e+03 2.659e+03 5.260e+03, threshold=3.284e+03, percent-clipped=38.0 2023-06-28 22:03:02,130 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2155806.0, ans=0.125 2023-06-28 22:03:05,401 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2155806.0, ans=0.125 2023-06-28 22:03:08,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2155806.0, ans=0.125 2023-06-28 22:03:15,487 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2155866.0, ans=0.2 2023-06-28 22:03:53,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2155926.0, ans=0.125 2023-06-28 22:03:55,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2155926.0, ans=0.0 2023-06-28 22:04:19,073 INFO [train.py:996] (3/4) Epoch 12, batch 23900, loss[loss=0.2124, simple_loss=0.289, pruned_loss=0.06789, over 20738.00 frames. ], tot_loss[loss=0.2207, simple_loss=0.3052, pruned_loss=0.06811, over 4279963.20 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:05:23,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.whiten.whitening_limit, batch_count=2156226.0, ans=12.0 2023-06-28 22:05:47,222 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-28 22:05:53,068 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2156286.0, ans=0.125 2023-06-28 22:06:00,802 INFO [train.py:996] (3/4) Epoch 12, batch 23950, loss[loss=0.2244, simple_loss=0.2976, pruned_loss=0.07563, over 21486.00 frames. ], tot_loss[loss=0.218, simple_loss=0.299, pruned_loss=0.06851, over 4269175.04 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 8.0 2023-06-28 22:06:25,836 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.975e+02 6.890e+02 9.042e+02 1.238e+03 2.308e+03, threshold=1.808e+03, percent-clipped=0.0 2023-06-28 22:07:34,978 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-28 22:07:48,524 INFO [train.py:996] (3/4) Epoch 12, batch 24000, loss[loss=0.2462, simple_loss=0.3223, pruned_loss=0.08505, over 21677.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3002, pruned_loss=0.07117, over 4270445.41 frames. ], batch size: 391, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:07:48,525 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 22:08:05,132 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.264, simple_loss=0.3553, pruned_loss=0.08634, over 1796401.00 frames. 2023-06-28 22:08:05,133 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 22:08:34,066 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2156706.0, ans=0.0 2023-06-28 22:09:04,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2156766.0, ans=0.125 2023-06-28 22:09:49,058 INFO [train.py:996] (3/4) Epoch 12, batch 24050, loss[loss=0.191, simple_loss=0.2862, pruned_loss=0.04789, over 21754.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.3008, pruned_loss=0.07097, over 4268067.79 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:09:55,293 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-28 22:10:14,177 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.391e+02 8.286e+02 1.353e+03 2.052e+03 4.335e+03, threshold=2.707e+03, percent-clipped=33.0 2023-06-28 22:10:39,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2157066.0, ans=0.0 2023-06-28 22:10:39,289 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2157066.0, ans=0.09899494936611666 2023-06-28 22:10:46,159 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2157066.0, ans=0.125 2023-06-28 22:10:53,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=15.0 2023-06-28 22:10:54,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2157126.0, ans=0.1 2023-06-28 22:11:20,574 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2157186.0, ans=0.125 2023-06-28 22:11:25,384 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2157186.0, ans=0.04949747468305833 2023-06-28 22:11:31,511 INFO [train.py:996] (3/4) Epoch 12, batch 24100, loss[loss=0.2333, simple_loss=0.3182, pruned_loss=0.07425, over 21657.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3026, pruned_loss=0.06999, over 4271640.92 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:11:32,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2157246.0, ans=0.1 2023-06-28 22:11:32,217 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2157246.0, ans=0.0 2023-06-28 22:11:46,771 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2157306.0, ans=0.125 2023-06-28 22:12:37,980 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2157426.0, ans=0.125 2023-06-28 22:12:45,143 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-28 22:12:46,613 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=22.5 2023-06-28 22:13:09,451 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-28 22:13:10,512 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2157486.0, ans=0.125 2023-06-28 22:13:13,305 INFO [train.py:996] (3/4) Epoch 12, batch 24150, loss[loss=0.2293, simple_loss=0.3113, pruned_loss=0.07359, over 21889.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3033, pruned_loss=0.07209, over 4279584.49 frames. ], batch size: 107, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:13:25,431 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2157546.0, ans=0.125 2023-06-28 22:13:43,013 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.450e+02 8.470e+02 1.133e+03 1.588e+03 3.416e+03, threshold=2.267e+03, percent-clipped=5.0 2023-06-28 22:14:35,883 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-28 22:14:41,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2157786.0, ans=0.2 2023-06-28 22:14:56,845 INFO [train.py:996] (3/4) Epoch 12, batch 24200, loss[loss=0.2335, simple_loss=0.3187, pruned_loss=0.07416, over 21785.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3049, pruned_loss=0.0735, over 4277502.13 frames. ], batch size: 316, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:15:14,817 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2157846.0, ans=0.1 2023-06-28 22:15:36,538 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2157906.0, ans=0.125 2023-06-28 22:15:38,611 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.33 vs. limit=6.0 2023-06-28 22:16:47,970 INFO [train.py:996] (3/4) Epoch 12, batch 24250, loss[loss=0.1809, simple_loss=0.287, pruned_loss=0.03735, over 21659.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.3018, pruned_loss=0.06752, over 4277804.38 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:17:02,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2158146.0, ans=0.0 2023-06-28 22:17:17,819 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.733e+02 8.184e+02 1.120e+03 1.541e+03 3.593e+03, threshold=2.240e+03, percent-clipped=10.0 2023-06-28 22:17:20,080 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2158206.0, ans=0.2 2023-06-28 22:17:40,869 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-28 22:18:31,292 INFO [train.py:996] (3/4) Epoch 12, batch 24300, loss[loss=0.1704, simple_loss=0.2534, pruned_loss=0.04368, over 21782.00 frames. ], tot_loss[loss=0.2109, simple_loss=0.2961, pruned_loss=0.0628, over 4276963.64 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:18:41,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2158446.0, ans=0.2 2023-06-28 22:19:04,841 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=2158506.0, ans=15.0 2023-06-28 22:19:18,094 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2158566.0, ans=0.0 2023-06-28 22:20:13,768 INFO [train.py:996] (3/4) Epoch 12, batch 24350, loss[loss=0.2591, simple_loss=0.3381, pruned_loss=0.09008, over 21795.00 frames. ], tot_loss[loss=0.2099, simple_loss=0.2934, pruned_loss=0.06324, over 4284046.97 frames. ], batch size: 124, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:20:38,905 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.237e+02 7.403e+02 1.076e+03 1.597e+03 3.002e+03, threshold=2.153e+03, percent-clipped=3.0 2023-06-28 22:21:43,225 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-28 22:21:52,105 INFO [train.py:996] (3/4) Epoch 12, batch 24400, loss[loss=0.2082, simple_loss=0.2748, pruned_loss=0.07076, over 20072.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2968, pruned_loss=0.06626, over 4284263.57 frames. ], batch size: 702, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 22:22:03,846 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=2159046.0, ans=0.02 2023-06-28 22:22:37,867 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2159166.0, ans=0.0 2023-06-28 22:22:39,865 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2159166.0, ans=0.125 2023-06-28 22:23:15,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2159226.0, ans=0.0 2023-06-28 22:23:39,925 INFO [train.py:996] (3/4) Epoch 12, batch 24450, loss[loss=0.204, simple_loss=0.3, pruned_loss=0.05404, over 21622.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2977, pruned_loss=0.06748, over 4278017.62 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:24:01,302 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.221e+02 9.735e+02 1.433e+03 2.433e+03 5.313e+03, threshold=2.865e+03, percent-clipped=29.0 2023-06-28 22:24:36,007 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.88 vs. limit=15.0 2023-06-28 22:24:58,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2159526.0, ans=0.1 2023-06-28 22:25:00,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2159586.0, ans=0.125 2023-06-28 22:25:22,563 INFO [train.py:996] (3/4) Epoch 12, batch 24500, loss[loss=0.2156, simple_loss=0.2923, pruned_loss=0.06943, over 21939.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2977, pruned_loss=0.06751, over 4280342.86 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:25:28,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2159646.0, ans=0.1 2023-06-28 22:25:50,029 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2159706.0, ans=0.125 2023-06-28 22:26:19,578 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2159766.0, ans=10.0 2023-06-28 22:26:35,897 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2159826.0, ans=0.0 2023-06-28 22:27:04,765 INFO [train.py:996] (3/4) Epoch 12, batch 24550, loss[loss=0.2236, simple_loss=0.3034, pruned_loss=0.07196, over 21419.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2993, pruned_loss=0.06888, over 4280177.68 frames. ], batch size: 159, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:27:22,211 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2160006.0, ans=0.125 2023-06-28 22:27:28,296 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.692e+02 8.632e+02 1.069e+03 1.677e+03 3.577e+03, threshold=2.139e+03, percent-clipped=6.0 2023-06-28 22:27:29,035 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2160006.0, ans=0.125 2023-06-28 22:27:32,865 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2023-06-28 22:28:07,972 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2160066.0, ans=0.5 2023-06-28 22:28:11,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2160126.0, ans=0.0 2023-06-28 22:28:21,215 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2160126.0, ans=0.09899494936611666 2023-06-28 22:28:22,736 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2160126.0, ans=10.0 2023-06-28 22:28:48,562 INFO [train.py:996] (3/4) Epoch 12, batch 24600, loss[loss=0.2048, simple_loss=0.2736, pruned_loss=0.068, over 21803.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2949, pruned_loss=0.0687, over 4279149.96 frames. ], batch size: 352, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:28:49,846 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-28 22:29:22,877 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2160306.0, ans=0.2 2023-06-28 22:29:37,341 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=2160366.0, ans=0.1 2023-06-28 22:30:28,219 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.74 vs. limit=10.0 2023-06-28 22:30:32,027 INFO [train.py:996] (3/4) Epoch 12, batch 24650, loss[loss=0.1743, simple_loss=0.2355, pruned_loss=0.05657, over 21412.00 frames. ], tot_loss[loss=0.2105, simple_loss=0.2876, pruned_loss=0.06668, over 4276644.16 frames. ], batch size: 212, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:30:53,461 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.507e+02 9.210e+02 1.420e+03 2.040e+03 4.110e+03, threshold=2.841e+03, percent-clipped=23.0 2023-06-28 22:31:27,323 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-28 22:31:43,191 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2160726.0, ans=0.125 2023-06-28 22:31:46,556 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2160726.0, ans=0.125 2023-06-28 22:31:47,117 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.08 vs. limit=22.5 2023-06-28 22:31:49,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=2160726.0, ans=0.02 2023-06-28 22:32:13,587 INFO [train.py:996] (3/4) Epoch 12, batch 24700, loss[loss=0.1773, simple_loss=0.2657, pruned_loss=0.04442, over 21451.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2851, pruned_loss=0.06512, over 4276608.14 frames. ], batch size: 194, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:32:21,987 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2160846.0, ans=0.0 2023-06-28 22:32:50,275 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2160966.0, ans=0.2 2023-06-28 22:32:58,519 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2160966.0, ans=0.0 2023-06-28 22:33:54,737 INFO [train.py:996] (3/4) Epoch 12, batch 24750, loss[loss=0.2149, simple_loss=0.2822, pruned_loss=0.07386, over 21836.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2794, pruned_loss=0.06329, over 4268068.25 frames. ], batch size: 107, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:34:11,587 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2161206.0, ans=0.125 2023-06-28 22:34:15,908 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.049e+02 6.504e+02 9.325e+02 1.249e+03 2.794e+03, threshold=1.865e+03, percent-clipped=0.0 2023-06-28 22:34:16,645 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2161206.0, ans=0.125 2023-06-28 22:34:18,103 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2161206.0, ans=0.125 2023-06-28 22:34:19,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2161206.0, ans=0.2 2023-06-28 22:35:35,292 INFO [train.py:996] (3/4) Epoch 12, batch 24800, loss[loss=0.2231, simple_loss=0.2907, pruned_loss=0.07778, over 21823.00 frames. ], tot_loss[loss=0.2008, simple_loss=0.2748, pruned_loss=0.06339, over 4272660.40 frames. ], batch size: 391, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 22:36:18,662 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-28 22:36:30,041 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2161566.0, ans=0.125 2023-06-28 22:36:34,134 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2023-06-28 22:36:57,954 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2161626.0, ans=0.0 2023-06-28 22:37:12,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2161686.0, ans=0.1 2023-06-28 22:37:19,244 INFO [train.py:996] (3/4) Epoch 12, batch 24850, loss[loss=0.2246, simple_loss=0.3066, pruned_loss=0.07124, over 21692.00 frames. ], tot_loss[loss=0.202, simple_loss=0.2749, pruned_loss=0.06455, over 4280496.86 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:37:19,780 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2161746.0, ans=0.1 2023-06-28 22:37:23,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2161746.0, ans=0.0 2023-06-28 22:37:42,809 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.956e+02 8.268e+02 1.225e+03 1.737e+03 3.601e+03, threshold=2.449e+03, percent-clipped=20.0 2023-06-28 22:37:46,931 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2161806.0, ans=0.125 2023-06-28 22:38:14,963 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2161866.0, ans=0.0 2023-06-28 22:38:32,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2161926.0, ans=0.05 2023-06-28 22:39:01,672 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.33 vs. limit=12.0 2023-06-28 22:39:01,936 INFO [train.py:996] (3/4) Epoch 12, batch 24900, loss[loss=0.2189, simple_loss=0.2993, pruned_loss=0.0692, over 21459.00 frames. ], tot_loss[loss=0.2052, simple_loss=0.2789, pruned_loss=0.0658, over 4281766.17 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:40:12,747 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=2162226.0, ans=0.05 2023-06-28 22:40:17,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2162226.0, ans=0.125 2023-06-28 22:40:36,853 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2162286.0, ans=0.125 2023-06-28 22:40:46,423 INFO [train.py:996] (3/4) Epoch 12, batch 24950, loss[loss=0.2246, simple_loss=0.3025, pruned_loss=0.07336, over 21609.00 frames. ], tot_loss[loss=0.2119, simple_loss=0.2857, pruned_loss=0.06901, over 4278668.07 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:41:20,554 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.276e+02 8.663e+02 1.354e+03 1.983e+03 3.739e+03, threshold=2.709e+03, percent-clipped=10.0 2023-06-28 22:41:30,316 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.70 vs. limit=15.0 2023-06-28 22:41:31,558 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2162406.0, ans=0.0 2023-06-28 22:41:55,192 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2162526.0, ans=0.125 2023-06-28 22:42:07,093 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2162526.0, ans=0.0 2023-06-28 22:42:23,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2162586.0, ans=0.125 2023-06-28 22:42:31,501 INFO [train.py:996] (3/4) Epoch 12, batch 25000, loss[loss=0.1729, simple_loss=0.2232, pruned_loss=0.06127, over 20300.00 frames. ], tot_loss[loss=0.2161, simple_loss=0.2906, pruned_loss=0.07084, over 4278124.25 frames. ], batch size: 703, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:42:33,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2162646.0, ans=0.2 2023-06-28 22:42:38,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2162646.0, ans=0.04949747468305833 2023-06-28 22:42:58,584 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2162706.0, ans=0.1 2023-06-28 22:43:09,324 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2162706.0, ans=0.125 2023-06-28 22:44:12,539 INFO [train.py:996] (3/4) Epoch 12, batch 25050, loss[loss=0.1662, simple_loss=0.2226, pruned_loss=0.05489, over 20656.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2839, pruned_loss=0.06933, over 4275462.89 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:44:30,703 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2162946.0, ans=0.125 2023-06-28 22:44:49,733 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.916e+02 6.443e+02 9.220e+02 1.309e+03 4.556e+03, threshold=1.844e+03, percent-clipped=4.0 2023-06-28 22:45:03,910 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=8.95 vs. limit=22.5 2023-06-28 22:45:45,776 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.03 vs. limit=15.0 2023-06-28 22:45:54,124 INFO [train.py:996] (3/4) Epoch 12, batch 25100, loss[loss=0.1853, simple_loss=0.265, pruned_loss=0.05278, over 21681.00 frames. ], tot_loss[loss=0.2075, simple_loss=0.2786, pruned_loss=0.06822, over 4275088.09 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:47:03,127 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2163426.0, ans=0.2 2023-06-28 22:47:30,132 INFO [train.py:996] (3/4) Epoch 12, batch 25150, loss[loss=0.2203, simple_loss=0.2951, pruned_loss=0.07276, over 21621.00 frames. ], tot_loss[loss=0.2079, simple_loss=0.2827, pruned_loss=0.06653, over 4258884.32 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 16.0 2023-06-28 22:47:39,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2163546.0, ans=0.125 2023-06-28 22:47:40,749 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=2163546.0, ans=0.1 2023-06-28 22:48:07,815 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.073e+02 7.241e+02 9.101e+02 1.469e+03 3.331e+03, threshold=1.820e+03, percent-clipped=11.0 2023-06-28 22:48:18,838 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-28 22:48:26,811 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2163666.0, ans=0.2 2023-06-28 22:48:28,391 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2163666.0, ans=0.04949747468305833 2023-06-28 22:48:53,351 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2163726.0, ans=0.125 2023-06-28 22:49:12,488 INFO [train.py:996] (3/4) Epoch 12, batch 25200, loss[loss=0.1772, simple_loss=0.2628, pruned_loss=0.04582, over 21455.00 frames. ], tot_loss[loss=0.2062, simple_loss=0.2829, pruned_loss=0.06475, over 4260965.50 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 32.0 2023-06-28 22:49:30,936 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2163846.0, ans=0.0 2023-06-28 22:50:30,278 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2164026.0, ans=0.125 2023-06-28 22:50:41,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2164086.0, ans=0.0 2023-06-28 22:50:43,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2164086.0, ans=0.125 2023-06-28 22:50:54,692 INFO [train.py:996] (3/4) Epoch 12, batch 25250, loss[loss=0.2098, simple_loss=0.2609, pruned_loss=0.07935, over 21326.00 frames. ], tot_loss[loss=0.2047, simple_loss=0.2825, pruned_loss=0.06351, over 4257995.51 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 22:51:06,372 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2164146.0, ans=0.0 2023-06-28 22:51:33,358 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.757e+02 8.180e+02 1.142e+03 1.720e+03 2.915e+03, threshold=2.285e+03, percent-clipped=21.0 2023-06-28 22:51:36,216 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.51 vs. limit=15.0 2023-06-28 22:52:36,300 INFO [train.py:996] (3/4) Epoch 12, batch 25300, loss[loss=0.2361, simple_loss=0.3089, pruned_loss=0.08161, over 21314.00 frames. ], tot_loss[loss=0.2026, simple_loss=0.2795, pruned_loss=0.06281, over 4253316.94 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:52:36,852 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2164446.0, ans=0.125 2023-06-28 22:52:40,019 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2164446.0, ans=0.0 2023-06-28 22:52:49,604 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2164446.0, ans=0.07 2023-06-28 22:53:07,588 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2164506.0, ans=0.125 2023-06-28 22:53:19,330 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2164506.0, ans=10.0 2023-06-28 22:53:36,885 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2164566.0, ans=0.0 2023-06-28 22:53:56,437 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2164626.0, ans=0.125 2023-06-28 22:53:56,996 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=15.0 2023-06-28 22:54:22,188 INFO [train.py:996] (3/4) Epoch 12, batch 25350, loss[loss=0.1955, simple_loss=0.2792, pruned_loss=0.05596, over 21600.00 frames. ], tot_loss[loss=0.2035, simple_loss=0.282, pruned_loss=0.06257, over 4255896.36 frames. ], batch size: 414, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:54:45,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2164806.0, ans=0.1 2023-06-28 22:54:55,964 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.352e+02 8.129e+02 1.301e+03 1.964e+03 4.138e+03, threshold=2.601e+03, percent-clipped=21.0 2023-06-28 22:54:58,085 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2164806.0, ans=0.125 2023-06-28 22:55:32,687 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-28 22:55:33,794 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2164926.0, ans=0.1 2023-06-28 22:55:41,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2164986.0, ans=0.125 2023-06-28 22:55:55,064 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.72 vs. limit=15.0 2023-06-28 22:55:55,080 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.46 vs. limit=6.0 2023-06-28 22:55:57,308 INFO [train.py:996] (3/4) Epoch 12, batch 25400, loss[loss=0.2285, simple_loss=0.2791, pruned_loss=0.08896, over 21242.00 frames. ], tot_loss[loss=0.2005, simple_loss=0.2773, pruned_loss=0.06182, over 4256493.30 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:56:37,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2165106.0, ans=0.125 2023-06-28 22:57:00,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2165166.0, ans=0.0 2023-06-28 22:57:15,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2165226.0, ans=0.0 2023-06-28 22:57:33,157 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2165286.0, ans=0.0 2023-06-28 22:57:37,575 INFO [train.py:996] (3/4) Epoch 12, batch 25450, loss[loss=0.1925, simple_loss=0.2706, pruned_loss=0.05716, over 15470.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2778, pruned_loss=0.06326, over 4254054.32 frames. ], batch size: 62, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:58:11,894 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.754e+02 9.465e+02 1.393e+03 2.029e+03 3.933e+03, threshold=2.786e+03, percent-clipped=12.0 2023-06-28 22:58:33,844 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2165466.0, ans=0.0 2023-06-28 22:59:11,046 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2165586.0, ans=0.0 2023-06-28 22:59:20,044 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.13 vs. limit=10.0 2023-06-28 22:59:21,771 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.68 vs. limit=22.5 2023-06-28 22:59:25,649 INFO [train.py:996] (3/4) Epoch 12, batch 25500, loss[loss=0.1243, simple_loss=0.1947, pruned_loss=0.02697, over 16859.00 frames. ], tot_loss[loss=0.1985, simple_loss=0.2769, pruned_loss=0.06008, over 4249165.84 frames. ], batch size: 60, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 22:59:42,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2165646.0, ans=0.125 2023-06-28 22:59:59,196 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-28 23:00:12,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2165766.0, ans=0.125 2023-06-28 23:00:20,280 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2165766.0, ans=0.125 2023-06-28 23:00:26,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2165766.0, ans=0.0 2023-06-28 23:00:29,942 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2165826.0, ans=0.125 2023-06-28 23:01:08,424 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.22 vs. limit=22.5 2023-06-28 23:01:11,414 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-28 23:01:12,020 INFO [train.py:996] (3/4) Epoch 12, batch 25550, loss[loss=0.208, simple_loss=0.3245, pruned_loss=0.04575, over 20768.00 frames. ], tot_loss[loss=0.2022, simple_loss=0.2844, pruned_loss=0.06002, over 4254466.16 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:01:23,566 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.57 vs. limit=10.0 2023-06-28 23:01:30,833 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2165946.0, ans=0.0 2023-06-28 23:01:45,805 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2166006.0, ans=0.125 2023-06-28 23:01:46,810 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.170e+02 8.424e+02 1.256e+03 1.965e+03 3.448e+03, threshold=2.512e+03, percent-clipped=4.0 2023-06-28 23:02:39,698 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2166186.0, ans=0.125 2023-06-28 23:02:41,363 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2166186.0, ans=0.0 2023-06-28 23:02:51,077 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2166186.0, ans=0.125 2023-06-28 23:02:52,882 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2166246.0, ans=0.5 2023-06-28 23:02:58,673 INFO [train.py:996] (3/4) Epoch 12, batch 25600, loss[loss=0.1989, simple_loss=0.2687, pruned_loss=0.06452, over 19981.00 frames. ], tot_loss[loss=0.2055, simple_loss=0.2885, pruned_loss=0.06127, over 4253464.48 frames. ], batch size: 702, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:03:17,235 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2166246.0, ans=0.1 2023-06-28 23:03:38,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2166366.0, ans=0.125 2023-06-28 23:03:45,368 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2166366.0, ans=0.1 2023-06-28 23:03:57,174 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=15.0 2023-06-28 23:04:39,575 INFO [train.py:996] (3/4) Epoch 12, batch 25650, loss[loss=0.2022, simple_loss=0.2759, pruned_loss=0.06427, over 21757.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2882, pruned_loss=0.06261, over 4249509.47 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:05:07,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2166606.0, ans=0.125 2023-06-28 23:05:10,124 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.401e+02 8.404e+02 1.162e+03 1.787e+03 4.210e+03, threshold=2.325e+03, percent-clipped=7.0 2023-06-28 23:05:20,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2166666.0, ans=0.0 2023-06-28 23:05:24,332 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-28 23:05:35,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2166726.0, ans=0.0 2023-06-28 23:06:05,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2166786.0, ans=0.1 2023-06-28 23:06:19,764 INFO [train.py:996] (3/4) Epoch 12, batch 25700, loss[loss=0.2101, simple_loss=0.2785, pruned_loss=0.07082, over 21810.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2855, pruned_loss=0.06389, over 4259752.25 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:06:29,086 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2166846.0, ans=0.125 2023-06-28 23:06:40,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2166906.0, ans=0.125 2023-06-28 23:07:25,948 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2167026.0, ans=0.0 2023-06-28 23:08:07,857 INFO [train.py:996] (3/4) Epoch 12, batch 25750, loss[loss=0.2618, simple_loss=0.3574, pruned_loss=0.08309, over 21856.00 frames. ], tot_loss[loss=0.2112, simple_loss=0.2902, pruned_loss=0.06612, over 4266265.25 frames. ], batch size: 371, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:08:13,857 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2167146.0, ans=0.125 2023-06-28 23:08:19,418 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-28 23:08:20,458 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2167146.0, ans=0.125 2023-06-28 23:08:29,132 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2167206.0, ans=0.05 2023-06-28 23:08:39,821 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.094e+02 7.467e+02 1.127e+03 1.693e+03 5.779e+03, threshold=2.254e+03, percent-clipped=13.0 2023-06-28 23:08:40,472 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2167206.0, ans=0.125 2023-06-28 23:08:42,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2167206.0, ans=0.125 2023-06-28 23:08:46,222 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.91 vs. limit=15.0 2023-06-28 23:08:46,964 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2167266.0, ans=0.125 2023-06-28 23:08:53,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2167266.0, ans=0.125 2023-06-28 23:09:51,798 INFO [train.py:996] (3/4) Epoch 12, batch 25800, loss[loss=0.2568, simple_loss=0.3358, pruned_loss=0.08889, over 21510.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.3013, pruned_loss=0.06984, over 4268680.60 frames. ], batch size: 194, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:09:52,306 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2167446.0, ans=0.125 2023-06-28 23:10:00,761 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2167446.0, ans=0.125 2023-06-28 23:10:00,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2167446.0, ans=0.2 2023-06-28 23:10:20,242 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2167506.0, ans=0.125 2023-06-28 23:11:05,118 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.27 vs. limit=15.0 2023-06-28 23:11:31,279 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-28 23:11:33,284 INFO [train.py:996] (3/4) Epoch 12, batch 25850, loss[loss=0.2066, simple_loss=0.284, pruned_loss=0.06459, over 21895.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.3036, pruned_loss=0.06974, over 4272744.82 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:12:08,744 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.172e+02 7.676e+02 1.093e+03 1.751e+03 3.507e+03, threshold=2.187e+03, percent-clipped=11.0 2023-06-28 23:13:23,851 INFO [train.py:996] (3/4) Epoch 12, batch 25900, loss[loss=0.2252, simple_loss=0.3138, pruned_loss=0.06828, over 21317.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3044, pruned_loss=0.0699, over 4279535.62 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:13:27,847 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2168046.0, ans=0.125 2023-06-28 23:13:45,940 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-28 23:13:49,184 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-28 23:14:11,830 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2168166.0, ans=0.1 2023-06-28 23:14:44,363 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.82 vs. limit=10.0 2023-06-28 23:14:53,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2168286.0, ans=0.04949747468305833 2023-06-28 23:15:07,746 INFO [train.py:996] (3/4) Epoch 12, batch 25950, loss[loss=0.2593, simple_loss=0.3399, pruned_loss=0.08934, over 21576.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3102, pruned_loss=0.07302, over 4279124.01 frames. ], batch size: 414, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:15:10,141 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2168346.0, ans=0.1 2023-06-28 23:15:36,798 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.23 vs. limit=15.0 2023-06-28 23:15:37,503 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2168406.0, ans=0.2 2023-06-28 23:15:43,674 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.181e+02 7.724e+02 1.093e+03 1.792e+03 4.212e+03, threshold=2.186e+03, percent-clipped=19.0 2023-06-28 23:15:57,845 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-06-28 23:16:09,090 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2168526.0, ans=0.125 2023-06-28 23:16:54,127 INFO [train.py:996] (3/4) Epoch 12, batch 26000, loss[loss=0.2179, simple_loss=0.3003, pruned_loss=0.06775, over 21439.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3101, pruned_loss=0.07182, over 4280537.32 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:17:11,146 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2168646.0, ans=0.125 2023-06-28 23:17:54,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-28 23:18:05,657 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2168826.0, ans=0.0 2023-06-28 23:18:07,575 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2168826.0, ans=0.125 2023-06-28 23:18:17,489 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2168886.0, ans=0.125 2023-06-28 23:18:36,001 INFO [train.py:996] (3/4) Epoch 12, batch 26050, loss[loss=0.2034, simple_loss=0.2762, pruned_loss=0.06529, over 21890.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3099, pruned_loss=0.07186, over 4278976.61 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:19:08,203 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.201e+02 7.488e+02 9.436e+02 1.230e+03 3.511e+03, threshold=1.887e+03, percent-clipped=1.0 2023-06-28 23:20:16,747 INFO [train.py:996] (3/4) Epoch 12, batch 26100, loss[loss=0.1995, simple_loss=0.2655, pruned_loss=0.06672, over 20972.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.3039, pruned_loss=0.07152, over 4280375.12 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:22:03,444 INFO [train.py:996] (3/4) Epoch 12, batch 26150, loss[loss=0.21, simple_loss=0.2734, pruned_loss=0.07327, over 20063.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3004, pruned_loss=0.07226, over 4277190.00 frames. ], batch size: 702, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:22:07,296 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2169546.0, ans=0.125 2023-06-28 23:22:17,622 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=12.0 2023-06-28 23:22:27,287 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2169606.0, ans=0.125 2023-06-28 23:22:31,479 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.410e+02 8.718e+02 1.214e+03 1.632e+03 3.208e+03, threshold=2.428e+03, percent-clipped=15.0 2023-06-28 23:22:35,646 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2169666.0, ans=0.0 2023-06-28 23:23:00,074 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-28 23:23:42,554 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-28 23:23:44,677 INFO [train.py:996] (3/4) Epoch 12, batch 26200, loss[loss=0.2174, simple_loss=0.3223, pruned_loss=0.05631, over 21868.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3026, pruned_loss=0.07088, over 4275510.11 frames. ], batch size: 316, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:23:51,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2169846.0, ans=0.2 2023-06-28 23:24:31,727 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2169966.0, ans=0.1 2023-06-28 23:24:38,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2169966.0, ans=0.0 2023-06-28 23:24:45,052 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2170026.0, ans=0.1 2023-06-28 23:24:57,171 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2023-06-28 23:25:25,902 INFO [train.py:996] (3/4) Epoch 12, batch 26250, loss[loss=0.211, simple_loss=0.2862, pruned_loss=0.06789, over 21831.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.3057, pruned_loss=0.06967, over 4273079.47 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:25:32,832 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2170146.0, ans=0.0 2023-06-28 23:25:52,908 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.372e+02 9.468e+02 1.366e+03 2.102e+03 4.403e+03, threshold=2.732e+03, percent-clipped=13.0 2023-06-28 23:26:39,267 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2170386.0, ans=0.125 2023-06-28 23:26:52,036 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2170386.0, ans=0.1 2023-06-28 23:26:54,213 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=12.0 2023-06-28 23:27:01,313 INFO [train.py:996] (3/4) Epoch 12, batch 26300, loss[loss=0.2219, simple_loss=0.2913, pruned_loss=0.07626, over 22017.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3024, pruned_loss=0.06983, over 4275113.07 frames. ], batch size: 416, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:27:38,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2170506.0, ans=0.0 2023-06-28 23:27:42,078 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2170566.0, ans=0.1 2023-06-28 23:27:46,835 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2170566.0, ans=0.125 2023-06-28 23:28:42,454 INFO [train.py:996] (3/4) Epoch 12, batch 26350, loss[loss=0.2646, simple_loss=0.3359, pruned_loss=0.09666, over 21263.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3008, pruned_loss=0.07081, over 4282666.07 frames. ], batch size: 143, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:29:00,641 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2170746.0, ans=0.125 2023-06-28 23:29:13,757 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2170806.0, ans=0.0 2023-06-28 23:29:19,231 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.451e+02 7.926e+02 1.139e+03 2.111e+03 4.700e+03, threshold=2.277e+03, percent-clipped=11.0 2023-06-28 23:30:02,564 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-28 23:30:02,837 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-28 23:30:05,654 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2170986.0, ans=0.0 2023-06-28 23:30:20,490 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2170986.0, ans=0.1 2023-06-28 23:30:23,052 INFO [train.py:996] (3/4) Epoch 12, batch 26400, loss[loss=0.1831, simple_loss=0.2467, pruned_loss=0.0598, over 21623.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2944, pruned_loss=0.07056, over 4275619.02 frames. ], batch size: 231, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:30:55,471 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2171106.0, ans=0.0 2023-06-28 23:31:00,592 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2171106.0, ans=10.0 2023-06-28 23:31:40,878 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-28 23:32:16,581 INFO [train.py:996] (3/4) Epoch 12, batch 26450, loss[loss=0.2273, simple_loss=0.3322, pruned_loss=0.06122, over 20743.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2931, pruned_loss=0.07, over 4247641.72 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:32:32,870 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2171406.0, ans=0.2 2023-06-28 23:32:51,626 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.901e+02 9.721e+02 1.441e+03 2.127e+03 5.226e+03, threshold=2.882e+03, percent-clipped=23.0 2023-06-28 23:32:55,630 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2171466.0, ans=0.125 2023-06-28 23:33:38,907 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-28 23:33:42,625 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-28 23:33:59,928 INFO [train.py:996] (3/4) Epoch 12, batch 26500, loss[loss=0.1857, simple_loss=0.2551, pruned_loss=0.05813, over 21398.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2953, pruned_loss=0.06913, over 4241100.86 frames. ], batch size: 194, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:35:13,826 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2171826.0, ans=0.125 2023-06-28 23:35:47,898 INFO [train.py:996] (3/4) Epoch 12, batch 26550, loss[loss=0.1855, simple_loss=0.2706, pruned_loss=0.0502, over 21609.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2936, pruned_loss=0.06657, over 4248595.12 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:36:00,565 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=2.97 vs. limit=10.0 2023-06-28 23:36:11,902 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2172006.0, ans=0.125 2023-06-28 23:36:15,453 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2172006.0, ans=0.0 2023-06-28 23:36:23,133 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.969e+02 7.971e+02 1.184e+03 2.245e+03 4.419e+03, threshold=2.369e+03, percent-clipped=15.0 2023-06-28 23:37:28,574 INFO [train.py:996] (3/4) Epoch 12, batch 26600, loss[loss=0.2469, simple_loss=0.3039, pruned_loss=0.09491, over 21388.00 frames. ], tot_loss[loss=0.2125, simple_loss=0.2948, pruned_loss=0.06509, over 4253142.81 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:37:56,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=2172306.0, ans=0.025 2023-06-28 23:38:03,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2172306.0, ans=0.1 2023-06-28 23:38:37,589 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-28 23:39:02,286 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2172486.0, ans=0.125 2023-06-28 23:39:08,223 INFO [train.py:996] (3/4) Epoch 12, batch 26650, loss[loss=0.1549, simple_loss=0.2393, pruned_loss=0.03526, over 21749.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2878, pruned_loss=0.06389, over 4250618.26 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:39:23,087 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2172546.0, ans=0.1 2023-06-28 23:39:43,563 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2172606.0, ans=0.0 2023-06-28 23:39:46,394 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.698e+02 6.768e+02 8.885e+02 1.234e+03 3.430e+03, threshold=1.777e+03, percent-clipped=1.0 2023-06-28 23:39:58,109 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2172666.0, ans=0.125 2023-06-28 23:40:05,824 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2172666.0, ans=0.1 2023-06-28 23:40:20,454 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2172726.0, ans=0.125 2023-06-28 23:40:20,488 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2172726.0, ans=10.0 2023-06-28 23:40:33,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2172786.0, ans=0.125 2023-06-28 23:40:52,087 INFO [train.py:996] (3/4) Epoch 12, batch 26700, loss[loss=0.2168, simple_loss=0.2935, pruned_loss=0.07004, over 21873.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2809, pruned_loss=0.06138, over 4251399.70 frames. ], batch size: 107, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:41:01,276 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2172846.0, ans=0.1 2023-06-28 23:41:19,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2172906.0, ans=0.07 2023-06-28 23:41:35,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2172966.0, ans=0.0 2023-06-28 23:42:06,259 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2173026.0, ans=0.125 2023-06-28 23:42:22,403 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2173086.0, ans=0.1 2023-06-28 23:42:33,655 INFO [train.py:996] (3/4) Epoch 12, batch 26750, loss[loss=0.2479, simple_loss=0.3344, pruned_loss=0.08068, over 21811.00 frames. ], tot_loss[loss=0.2013, simple_loss=0.2816, pruned_loss=0.06053, over 4256196.41 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:43:06,961 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2173206.0, ans=0.125 2023-06-28 23:43:12,966 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.416e+02 8.010e+02 1.094e+03 1.630e+03 3.819e+03, threshold=2.187e+03, percent-clipped=19.0 2023-06-28 23:43:16,082 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=15.0 2023-06-28 23:43:43,497 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2173326.0, ans=0.0 2023-06-28 23:43:52,982 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2173326.0, ans=0.0 2023-06-28 23:44:06,465 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2173386.0, ans=0.125 2023-06-28 23:44:16,177 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2173386.0, ans=0.1 2023-06-28 23:44:20,462 INFO [train.py:996] (3/4) Epoch 12, batch 26800, loss[loss=0.2076, simple_loss=0.2859, pruned_loss=0.0646, over 21763.00 frames. ], tot_loss[loss=0.2086, simple_loss=0.2887, pruned_loss=0.06425, over 4269084.99 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:44:37,802 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-28 23:44:59,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2173506.0, ans=0.125 2023-06-28 23:45:02,294 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-28 23:45:03,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2173566.0, ans=0.125 2023-06-28 23:45:08,004 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2173566.0, ans=0.125 2023-06-28 23:46:05,357 INFO [train.py:996] (3/4) Epoch 12, batch 26850, loss[loss=0.1948, simple_loss=0.2601, pruned_loss=0.06471, over 21587.00 frames. ], tot_loss[loss=0.2115, simple_loss=0.2894, pruned_loss=0.06674, over 4267616.43 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:46:26,753 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2173806.0, ans=0.0 2023-06-28 23:46:30,120 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2173806.0, ans=0.125 2023-06-28 23:46:39,689 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2173806.0, ans=0.125 2023-06-28 23:46:40,817 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.776e+02 8.023e+02 1.160e+03 1.579e+03 4.505e+03, threshold=2.321e+03, percent-clipped=8.0 2023-06-28 23:47:08,928 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-28 23:47:39,175 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2174046.0, ans=0.125 2023-06-28 23:47:40,060 INFO [train.py:996] (3/4) Epoch 12, batch 26900, loss[loss=0.2023, simple_loss=0.2747, pruned_loss=0.06489, over 19998.00 frames. ], tot_loss[loss=0.2065, simple_loss=0.2812, pruned_loss=0.06586, over 4267629.87 frames. ], batch size: 702, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:48:04,680 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-06-28 23:48:06,049 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2174106.0, ans=0.125 2023-06-28 23:48:40,677 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-28 23:49:00,451 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2174286.0, ans=0.0 2023-06-28 23:49:12,849 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2174286.0, ans=0.1 2023-06-28 23:49:19,000 INFO [train.py:996] (3/4) Epoch 12, batch 26950, loss[loss=0.22, simple_loss=0.3097, pruned_loss=0.06515, over 21324.00 frames. ], tot_loss[loss=0.2069, simple_loss=0.282, pruned_loss=0.06593, over 4268367.96 frames. ], batch size: 176, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:49:22,549 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2174346.0, ans=0.125 2023-06-28 23:49:52,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2174406.0, ans=0.125 2023-06-28 23:49:54,864 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.876e+02 6.986e+02 1.003e+03 1.529e+03 4.492e+03, threshold=2.006e+03, percent-clipped=11.0 2023-06-28 23:49:55,688 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2174466.0, ans=0.125 2023-06-28 23:50:53,788 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2174586.0, ans=0.1 2023-06-28 23:51:06,156 INFO [train.py:996] (3/4) Epoch 12, batch 27000, loss[loss=0.2016, simple_loss=0.3115, pruned_loss=0.04587, over 20845.00 frames. ], tot_loss[loss=0.204, simple_loss=0.2817, pruned_loss=0.06318, over 4252940.18 frames. ], batch size: 608, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:51:06,157 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-28 23:51:22,022 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.2512, simple_loss=0.3387, pruned_loss=0.08188, over 1796401.00 frames. 2023-06-28 23:51:22,023 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-28 23:53:01,299 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2174886.0, ans=0.2 2023-06-28 23:53:03,873 INFO [train.py:996] (3/4) Epoch 12, batch 27050, loss[loss=0.1772, simple_loss=0.2742, pruned_loss=0.04013, over 21643.00 frames. ], tot_loss[loss=0.203, simple_loss=0.2844, pruned_loss=0.06078, over 4252978.05 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:53:44,976 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.952e+02 1.010e+03 1.463e+03 2.409e+03 4.686e+03, threshold=2.925e+03, percent-clipped=39.0 2023-06-28 23:54:45,814 INFO [train.py:996] (3/4) Epoch 12, batch 27100, loss[loss=0.2449, simple_loss=0.3301, pruned_loss=0.07983, over 21607.00 frames. ], tot_loss[loss=0.2046, simple_loss=0.2858, pruned_loss=0.06166, over 4267616.82 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:54:49,526 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2175246.0, ans=0.0 2023-06-28 23:55:36,436 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.66 vs. limit=22.5 2023-06-28 23:55:47,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2175366.0, ans=0.125 2023-06-28 23:55:57,154 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-28 23:56:34,064 INFO [train.py:996] (3/4) Epoch 12, batch 27150, loss[loss=0.2474, simple_loss=0.3414, pruned_loss=0.07668, over 21829.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.2987, pruned_loss=0.06528, over 4272954.77 frames. ], batch size: 316, lr: 2.38e-03, grad_scale: 16.0 2023-06-28 23:56:48,656 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=12.0 2023-06-28 23:57:01,805 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-28 23:57:02,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2175606.0, ans=0.0 2023-06-28 23:57:10,151 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2175606.0, ans=0.125 2023-06-28 23:57:10,880 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-28 23:57:14,485 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.191e+02 8.496e+02 1.171e+03 1.771e+03 3.313e+03, threshold=2.341e+03, percent-clipped=5.0 2023-06-28 23:57:19,108 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=2175666.0, ans=15.0 2023-06-28 23:57:55,045 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.15 vs. limit=15.0 2023-06-28 23:58:09,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2175786.0, ans=0.125 2023-06-28 23:58:15,369 INFO [train.py:996] (3/4) Epoch 12, batch 27200, loss[loss=0.1998, simple_loss=0.2784, pruned_loss=0.06066, over 20053.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.3061, pruned_loss=0.06755, over 4269165.60 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 32.0 2023-06-28 23:58:15,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2175846.0, ans=0.0 2023-06-28 23:58:53,581 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2175906.0, ans=0.125 2023-06-28 23:58:53,600 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2175906.0, ans=0.1 2023-06-28 23:59:21,268 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2176026.0, ans=0.125 2023-06-28 23:59:46,201 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2176086.0, ans=0.0 2023-06-29 00:00:01,696 INFO [train.py:996] (3/4) Epoch 12, batch 27250, loss[loss=0.2329, simple_loss=0.3063, pruned_loss=0.07981, over 21403.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3069, pruned_loss=0.07048, over 4263753.88 frames. ], batch size: 549, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:00:45,296 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.531e+02 9.436e+02 1.424e+03 2.260e+03 4.305e+03, threshold=2.849e+03, percent-clipped=22.0 2023-06-29 00:01:22,327 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2176326.0, ans=0.125 2023-06-29 00:01:43,744 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2176386.0, ans=0.2 2023-06-29 00:01:46,084 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=15.0 2023-06-29 00:01:49,810 INFO [train.py:996] (3/4) Epoch 12, batch 27300, loss[loss=0.2101, simple_loss=0.2728, pruned_loss=0.0737, over 20034.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3083, pruned_loss=0.07175, over 4265487.41 frames. ], batch size: 703, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:02:15,559 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2176506.0, ans=0.0 2023-06-29 00:03:07,468 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2176626.0, ans=0.2 2023-06-29 00:03:07,562 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2176626.0, ans=0.0 2023-06-29 00:03:30,706 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2176746.0, ans=0.125 2023-06-29 00:03:31,713 INFO [train.py:996] (3/4) Epoch 12, batch 27350, loss[loss=0.197, simple_loss=0.2879, pruned_loss=0.05307, over 21777.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3115, pruned_loss=0.07257, over 4269982.37 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:03:33,627 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2176746.0, ans=0.1 2023-06-29 00:03:33,823 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2176746.0, ans=0.0 2023-06-29 00:03:53,117 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2176806.0, ans=0.0 2023-06-29 00:04:01,778 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-29 00:04:04,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2176806.0, ans=0.0 2023-06-29 00:04:13,494 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.353e+02 7.469e+02 1.032e+03 1.512e+03 4.171e+03, threshold=2.065e+03, percent-clipped=4.0 2023-06-29 00:04:51,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2176926.0, ans=0.125 2023-06-29 00:05:15,160 INFO [train.py:996] (3/4) Epoch 12, batch 27400, loss[loss=0.2025, simple_loss=0.2609, pruned_loss=0.07207, over 21631.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3071, pruned_loss=0.07233, over 4277286.12 frames. ], batch size: 231, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:05:17,547 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2177046.0, ans=0.125 2023-06-29 00:05:19,034 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2177046.0, ans=0.05 2023-06-29 00:06:06,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2177166.0, ans=0.0 2023-06-29 00:06:20,507 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2177226.0, ans=0.125 2023-06-29 00:06:55,562 INFO [train.py:996] (3/4) Epoch 12, batch 27450, loss[loss=0.2207, simple_loss=0.3057, pruned_loss=0.06789, over 21438.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.3024, pruned_loss=0.07096, over 4279809.46 frames. ], batch size: 194, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:07:12,577 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2177406.0, ans=0.0 2023-06-29 00:07:20,565 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2177406.0, ans=0.125 2023-06-29 00:07:32,478 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.358e+02 7.847e+02 1.147e+03 1.584e+03 3.380e+03, threshold=2.294e+03, percent-clipped=11.0 2023-06-29 00:07:59,922 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2177526.0, ans=0.125 2023-06-29 00:08:20,928 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:08:34,576 INFO [train.py:996] (3/4) Epoch 12, batch 27500, loss[loss=0.2094, simple_loss=0.284, pruned_loss=0.0674, over 21509.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2997, pruned_loss=0.07094, over 4282806.88 frames. ], batch size: 194, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:08:41,955 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.69 vs. limit=22.5 2023-06-29 00:09:56,170 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2177886.0, ans=0.125 2023-06-29 00:10:06,351 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-29 00:10:15,300 INFO [train.py:996] (3/4) Epoch 12, batch 27550, loss[loss=0.2261, simple_loss=0.2907, pruned_loss=0.08082, over 21354.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.294, pruned_loss=0.06836, over 4282753.09 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:10:31,611 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2177946.0, ans=0.1 2023-06-29 00:10:32,134 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=22.5 2023-06-29 00:10:49,449 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2178006.0, ans=0.5 2023-06-29 00:10:57,004 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.117e+02 1.004e+03 1.516e+03 2.430e+03 4.785e+03, threshold=3.032e+03, percent-clipped=27.0 2023-06-29 00:11:18,133 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2178126.0, ans=0.125 2023-06-29 00:11:44,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2178186.0, ans=0.125 2023-06-29 00:11:54,698 INFO [train.py:996] (3/4) Epoch 12, batch 27600, loss[loss=0.2008, simple_loss=0.2655, pruned_loss=0.06808, over 21796.00 frames. ], tot_loss[loss=0.211, simple_loss=0.2872, pruned_loss=0.06742, over 4276684.21 frames. ], batch size: 112, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:12:56,366 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2178426.0, ans=0.125 2023-06-29 00:13:34,122 INFO [train.py:996] (3/4) Epoch 12, batch 27650, loss[loss=0.1939, simple_loss=0.2608, pruned_loss=0.06353, over 21288.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.2822, pruned_loss=0.06718, over 4278124.74 frames. ], batch size: 176, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:13:36,420 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2178546.0, ans=0.0 2023-06-29 00:13:38,598 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=22.5 2023-06-29 00:14:07,502 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2178606.0, ans=0.2 2023-06-29 00:14:17,723 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 7.725e+02 1.101e+03 1.627e+03 3.974e+03, threshold=2.201e+03, percent-clipped=3.0 2023-06-29 00:14:19,998 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2178666.0, ans=0.2 2023-06-29 00:14:21,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2178666.0, ans=0.125 2023-06-29 00:14:31,616 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2178666.0, ans=0.125 2023-06-29 00:14:58,237 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2178786.0, ans=0.1 2023-06-29 00:15:15,561 INFO [train.py:996] (3/4) Epoch 12, batch 27700, loss[loss=0.1867, simple_loss=0.2639, pruned_loss=0.0548, over 21847.00 frames. ], tot_loss[loss=0.2058, simple_loss=0.2819, pruned_loss=0.06489, over 4276916.68 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:15:32,676 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2178846.0, ans=0.125 2023-06-29 00:15:44,537 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=2178906.0, ans=22.5 2023-06-29 00:15:57,006 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2178966.0, ans=0.2 2023-06-29 00:16:01,765 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2178966.0, ans=0.125 2023-06-29 00:16:16,188 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2178966.0, ans=0.1 2023-06-29 00:16:19,353 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2179026.0, ans=0.0 2023-06-29 00:16:46,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2179086.0, ans=0.125 2023-06-29 00:16:56,208 INFO [train.py:996] (3/4) Epoch 12, batch 27750, loss[loss=0.1922, simple_loss=0.2894, pruned_loss=0.0475, over 21292.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2861, pruned_loss=0.06473, over 4275502.90 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:17:35,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2179206.0, ans=0.125 2023-06-29 00:17:38,807 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2179266.0, ans=0.0 2023-06-29 00:17:39,849 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 6.037e+02 8.775e+02 1.414e+03 2.124e+03 3.615e+03, threshold=2.828e+03, percent-clipped=21.0 2023-06-29 00:18:07,573 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2179326.0, ans=0.0 2023-06-29 00:18:09,021 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2179326.0, ans=0.125 2023-06-29 00:18:12,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2179326.0, ans=0.125 2023-06-29 00:18:29,557 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2179386.0, ans=0.125 2023-06-29 00:18:35,463 INFO [train.py:996] (3/4) Epoch 12, batch 27800, loss[loss=0.236, simple_loss=0.3002, pruned_loss=0.08595, over 21642.00 frames. ], tot_loss[loss=0.2077, simple_loss=0.2852, pruned_loss=0.06508, over 4277450.43 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:20:03,447 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2179686.0, ans=0.125 2023-06-29 00:20:16,292 INFO [train.py:996] (3/4) Epoch 12, batch 27850, loss[loss=0.2397, simple_loss=0.3124, pruned_loss=0.08351, over 21591.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2844, pruned_loss=0.06584, over 4285905.82 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:20:38,612 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-29 00:20:47,480 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2179806.0, ans=0.125 2023-06-29 00:20:51,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2179806.0, ans=0.125 2023-06-29 00:20:51,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2179806.0, ans=0.0 2023-06-29 00:21:00,408 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.968e+02 8.980e+02 1.586e+03 2.122e+03 3.865e+03, threshold=3.171e+03, percent-clipped=6.0 2023-06-29 00:21:09,821 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=12.0 2023-06-29 00:21:14,948 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-29 00:22:03,470 INFO [train.py:996] (3/4) Epoch 12, batch 27900, loss[loss=0.2429, simple_loss=0.3378, pruned_loss=0.07401, over 21651.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2949, pruned_loss=0.06691, over 4291193.70 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:22:04,023 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2180046.0, ans=0.0 2023-06-29 00:22:24,261 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2180106.0, ans=0.125 2023-06-29 00:22:35,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2180106.0, ans=0.125 2023-06-29 00:22:45,522 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2180166.0, ans=0.0 2023-06-29 00:23:27,571 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2180286.0, ans=0.0 2023-06-29 00:23:51,642 INFO [train.py:996] (3/4) Epoch 12, batch 27950, loss[loss=0.1845, simple_loss=0.2823, pruned_loss=0.04333, over 21838.00 frames. ], tot_loss[loss=0.2107, simple_loss=0.2943, pruned_loss=0.06354, over 4285703.47 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:24:26,756 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2180406.0, ans=0.0 2023-06-29 00:24:35,524 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.187e+02 9.154e+02 1.408e+03 1.897e+03 4.005e+03, threshold=2.816e+03, percent-clipped=4.0 2023-06-29 00:24:43,367 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-29 00:25:23,332 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2180586.0, ans=0.125 2023-06-29 00:25:31,890 INFO [train.py:996] (3/4) Epoch 12, batch 28000, loss[loss=0.201, simple_loss=0.304, pruned_loss=0.04899, over 21286.00 frames. ], tot_loss[loss=0.2078, simple_loss=0.2922, pruned_loss=0.06175, over 4288965.18 frames. ], batch size: 549, lr: 2.38e-03, grad_scale: 32.0 2023-06-29 00:25:32,443 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2180646.0, ans=0.125 2023-06-29 00:27:03,912 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2180886.0, ans=0.0 2023-06-29 00:27:15,057 INFO [train.py:996] (3/4) Epoch 12, batch 28050, loss[loss=0.174, simple_loss=0.2385, pruned_loss=0.05475, over 21442.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.289, pruned_loss=0.06276, over 4285138.59 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:28:00,399 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.693e+02 7.704e+02 1.092e+03 1.721e+03 4.655e+03, threshold=2.185e+03, percent-clipped=4.0 2023-06-29 00:28:01,067 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2181066.0, ans=0.125 2023-06-29 00:28:21,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2181126.0, ans=0.05 2023-06-29 00:28:59,168 INFO [train.py:996] (3/4) Epoch 12, batch 28100, loss[loss=0.2018, simple_loss=0.2669, pruned_loss=0.0683, over 21579.00 frames. ], tot_loss[loss=0.2061, simple_loss=0.2865, pruned_loss=0.06282, over 4284260.83 frames. ], batch size: 414, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:29:11,357 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2181246.0, ans=0.125 2023-06-29 00:29:28,733 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2181306.0, ans=0.125 2023-06-29 00:30:05,005 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2181426.0, ans=0.125 2023-06-29 00:30:39,423 INFO [train.py:996] (3/4) Epoch 12, batch 28150, loss[loss=0.1763, simple_loss=0.2423, pruned_loss=0.05511, over 21602.00 frames. ], tot_loss[loss=0.2018, simple_loss=0.2788, pruned_loss=0.0624, over 4277290.35 frames. ], batch size: 231, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:30:48,680 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2181546.0, ans=0.0 2023-06-29 00:31:07,457 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-29 00:31:20,816 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.946e+02 8.258e+02 1.413e+03 2.441e+03 4.810e+03, threshold=2.825e+03, percent-clipped=31.0 2023-06-29 00:31:45,171 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2181726.0, ans=0.0 2023-06-29 00:32:08,719 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-06-29 00:32:20,020 INFO [train.py:996] (3/4) Epoch 12, batch 28200, loss[loss=0.228, simple_loss=0.3032, pruned_loss=0.07646, over 21922.00 frames. ], tot_loss[loss=0.2028, simple_loss=0.2781, pruned_loss=0.06371, over 4266099.87 frames. ], batch size: 372, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:32:34,115 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.44 vs. limit=15.0 2023-06-29 00:32:40,062 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2181906.0, ans=0.0 2023-06-29 00:32:53,361 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2181906.0, ans=0.125 2023-06-29 00:34:06,090 INFO [train.py:996] (3/4) Epoch 12, batch 28250, loss[loss=0.2092, simple_loss=0.2828, pruned_loss=0.06778, over 20665.00 frames. ], tot_loss[loss=0.2072, simple_loss=0.2817, pruned_loss=0.0663, over 4260670.11 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-29 00:34:22,614 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-29 00:34:23,545 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2182206.0, ans=0.07 2023-06-29 00:34:26,790 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2182206.0, ans=0.0 2023-06-29 00:34:30,189 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2182206.0, ans=0.05 2023-06-29 00:34:31,974 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2182206.0, ans=0.0 2023-06-29 00:34:35,499 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2182206.0, ans=0.2 2023-06-29 00:34:40,868 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-29 00:34:48,082 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.469e+02 1.185e+03 1.669e+03 2.503e+03 4.651e+03, threshold=3.338e+03, percent-clipped=13.0 2023-06-29 00:34:51,193 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.93 vs. limit=5.0 2023-06-29 00:34:55,815 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-06-29 00:35:27,283 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2182386.0, ans=0.1 2023-06-29 00:35:48,350 INFO [train.py:996] (3/4) Epoch 12, batch 28300, loss[loss=0.1634, simple_loss=0.2228, pruned_loss=0.05201, over 20727.00 frames. ], tot_loss[loss=0.2045, simple_loss=0.2797, pruned_loss=0.0647, over 4255260.06 frames. ], batch size: 608, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:35:59,031 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2182446.0, ans=0.125 2023-06-29 00:36:07,848 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=22.5 2023-06-29 00:36:21,217 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=15.0 2023-06-29 00:36:41,776 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2182566.0, ans=0.125 2023-06-29 00:37:05,150 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=12.0 2023-06-29 00:37:29,514 INFO [train.py:996] (3/4) Epoch 12, batch 28350, loss[loss=0.1811, simple_loss=0.2759, pruned_loss=0.04311, over 21184.00 frames. ], tot_loss[loss=0.1991, simple_loss=0.2783, pruned_loss=0.06, over 4260589.17 frames. ], batch size: 548, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:37:33,406 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2182746.0, ans=0.0 2023-06-29 00:37:57,958 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2182806.0, ans=0.125 2023-06-29 00:38:15,051 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.438e+02 7.077e+02 1.027e+03 1.914e+03 4.296e+03, threshold=2.054e+03, percent-clipped=2.0 2023-06-29 00:38:20,452 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2182866.0, ans=0.125 2023-06-29 00:39:10,267 INFO [train.py:996] (3/4) Epoch 12, batch 28400, loss[loss=0.2022, simple_loss=0.276, pruned_loss=0.06416, over 21672.00 frames. ], tot_loss[loss=0.1981, simple_loss=0.2752, pruned_loss=0.06052, over 4258329.34 frames. ], batch size: 298, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 00:39:14,484 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2183046.0, ans=0.2 2023-06-29 00:39:37,782 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2183106.0, ans=0.0 2023-06-29 00:39:45,983 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2183106.0, ans=0.2 2023-06-29 00:39:58,990 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2183166.0, ans=0.125 2023-06-29 00:40:52,141 INFO [train.py:996] (3/4) Epoch 12, batch 28450, loss[loss=0.2758, simple_loss=0.3258, pruned_loss=0.1129, over 21660.00 frames. ], tot_loss[loss=0.205, simple_loss=0.2813, pruned_loss=0.06432, over 4261936.22 frames. ], batch size: 507, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:41:43,663 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.501e+02 7.826e+02 1.097e+03 1.608e+03 4.884e+03, threshold=2.195e+03, percent-clipped=11.0 2023-06-29 00:41:45,637 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2183466.0, ans=0.0 2023-06-29 00:42:05,639 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2183526.0, ans=0.125 2023-06-29 00:42:15,506 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2183586.0, ans=0.1 2023-06-29 00:42:38,364 INFO [train.py:996] (3/4) Epoch 12, batch 28500, loss[loss=0.2182, simple_loss=0.2982, pruned_loss=0.06907, over 21789.00 frames. ], tot_loss[loss=0.2083, simple_loss=0.284, pruned_loss=0.06631, over 4272857.16 frames. ], batch size: 351, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:42:41,212 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-29 00:42:47,946 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=15.0 2023-06-29 00:42:51,615 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-29 00:43:35,199 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2183766.0, ans=0.125 2023-06-29 00:43:50,455 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2183826.0, ans=0.125 2023-06-29 00:44:18,892 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2183886.0, ans=0.0 2023-06-29 00:44:21,534 INFO [train.py:996] (3/4) Epoch 12, batch 28550, loss[loss=0.2718, simple_loss=0.3683, pruned_loss=0.08762, over 21635.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2913, pruned_loss=0.06847, over 4267604.99 frames. ], batch size: 414, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:44:27,804 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-29 00:44:48,620 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2184006.0, ans=0.2 2023-06-29 00:44:57,096 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2184006.0, ans=0.0 2023-06-29 00:45:06,504 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2184066.0, ans=0.125 2023-06-29 00:45:09,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2184066.0, ans=0.125 2023-06-29 00:45:12,486 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.914e+02 9.584e+02 1.430e+03 2.076e+03 4.050e+03, threshold=2.859e+03, percent-clipped=23.0 2023-06-29 00:46:00,125 INFO [train.py:996] (3/4) Epoch 12, batch 28600, loss[loss=0.2135, simple_loss=0.2901, pruned_loss=0.06847, over 21793.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2976, pruned_loss=0.0704, over 4266806.60 frames. ], batch size: 352, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:46:09,050 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2184246.0, ans=0.125 2023-06-29 00:46:09,715 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-29 00:46:15,772 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:46:27,145 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2184306.0, ans=0.0 2023-06-29 00:46:34,705 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2184306.0, ans=0.125 2023-06-29 00:46:46,297 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 00:47:03,242 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-29 00:47:04,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2184426.0, ans=0.125 2023-06-29 00:47:33,546 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2184486.0, ans=0.1 2023-06-29 00:47:39,837 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2184486.0, ans=0.04949747468305833 2023-06-29 00:47:45,763 INFO [train.py:996] (3/4) Epoch 12, batch 28650, loss[loss=0.1922, simple_loss=0.254, pruned_loss=0.06519, over 21583.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2928, pruned_loss=0.07004, over 4266273.92 frames. ], batch size: 415, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:48:24,176 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2184666.0, ans=0.0 2023-06-29 00:48:30,053 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.107e+02 8.338e+02 1.219e+03 1.644e+03 3.488e+03, threshold=2.437e+03, percent-clipped=4.0 2023-06-29 00:49:26,554 INFO [train.py:996] (3/4) Epoch 12, batch 28700, loss[loss=0.2267, simple_loss=0.2976, pruned_loss=0.07791, over 21752.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2914, pruned_loss=0.07092, over 4263117.13 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:49:34,979 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2184846.0, ans=0.125 2023-06-29 00:50:06,088 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.05 vs. limit=10.0 2023-06-29 00:50:45,542 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2185086.0, ans=0.125 2023-06-29 00:50:55,027 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2185086.0, ans=0.04949747468305833 2023-06-29 00:51:06,115 INFO [train.py:996] (3/4) Epoch 12, batch 28750, loss[loss=0.1987, simple_loss=0.2684, pruned_loss=0.06447, over 21361.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2927, pruned_loss=0.07153, over 4273083.50 frames. ], batch size: 176, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 00:51:42,767 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2185266.0, ans=10.0 2023-06-29 00:51:50,491 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.316e+02 7.873e+02 1.096e+03 1.643e+03 3.604e+03, threshold=2.192e+03, percent-clipped=9.0 2023-06-29 00:52:00,976 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2185266.0, ans=0.125 2023-06-29 00:52:02,393 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2185266.0, ans=0.125 2023-06-29 00:52:47,797 INFO [train.py:996] (3/4) Epoch 12, batch 28800, loss[loss=0.2563, simple_loss=0.3279, pruned_loss=0.09231, over 21800.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2946, pruned_loss=0.07074, over 4266980.00 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:53:14,459 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2185506.0, ans=0.0 2023-06-29 00:53:53,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2185626.0, ans=0.125 2023-06-29 00:54:01,219 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2185626.0, ans=0.125 2023-06-29 00:54:25,720 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2185686.0, ans=0.0 2023-06-29 00:54:28,452 INFO [train.py:996] (3/4) Epoch 12, batch 28850, loss[loss=0.2216, simple_loss=0.2968, pruned_loss=0.07319, over 21672.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2958, pruned_loss=0.07256, over 4276293.22 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:54:49,305 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2185806.0, ans=0.1 2023-06-29 00:54:55,674 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2185806.0, ans=0.0 2023-06-29 00:54:55,797 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2185806.0, ans=0.0 2023-06-29 00:55:01,042 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2185866.0, ans=0.1 2023-06-29 00:55:18,117 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.464e+02 8.066e+02 1.240e+03 1.993e+03 4.428e+03, threshold=2.479e+03, percent-clipped=20.0 2023-06-29 00:56:11,322 INFO [train.py:996] (3/4) Epoch 12, batch 28900, loss[loss=0.2436, simple_loss=0.3286, pruned_loss=0.07932, over 21790.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2989, pruned_loss=0.07373, over 4283830.38 frames. ], batch size: 118, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:56:18,672 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2186046.0, ans=0.95 2023-06-29 00:57:32,418 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2186226.0, ans=0.0 2023-06-29 00:57:32,483 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2186226.0, ans=0.125 2023-06-29 00:57:53,907 INFO [train.py:996] (3/4) Epoch 12, batch 28950, loss[loss=0.2355, simple_loss=0.3255, pruned_loss=0.07277, over 21252.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2991, pruned_loss=0.07273, over 4277345.88 frames. ], batch size: 548, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 00:58:46,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2186466.0, ans=0.125 2023-06-29 00:58:47,798 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.559e+02 9.001e+02 1.307e+03 1.896e+03 3.907e+03, threshold=2.614e+03, percent-clipped=14.0 2023-06-29 00:59:21,900 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2186586.0, ans=0.0 2023-06-29 00:59:29,052 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-29 00:59:40,845 INFO [train.py:996] (3/4) Epoch 12, batch 29000, loss[loss=0.2619, simple_loss=0.332, pruned_loss=0.0959, over 21784.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3017, pruned_loss=0.07173, over 4273843.22 frames. ], batch size: 441, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:01:01,683 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-29 01:01:21,629 INFO [train.py:996] (3/4) Epoch 12, batch 29050, loss[loss=0.1891, simple_loss=0.2641, pruned_loss=0.05703, over 21870.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2997, pruned_loss=0.07227, over 4281403.68 frames. ], batch size: 298, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:01:38,777 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-29 01:01:57,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2187006.0, ans=0.1 2023-06-29 01:02:14,141 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.866e+02 7.703e+02 1.025e+03 1.554e+03 4.084e+03, threshold=2.051e+03, percent-clipped=7.0 2023-06-29 01:02:18,258 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2187066.0, ans=0.0 2023-06-29 01:02:26,784 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2187126.0, ans=0.125 2023-06-29 01:02:28,737 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-29 01:02:35,423 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.26 vs. limit=10.0 2023-06-29 01:02:41,498 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2187186.0, ans=0.0 2023-06-29 01:02:56,060 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2187186.0, ans=0.0 2023-06-29 01:03:00,146 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.71 vs. limit=10.0 2023-06-29 01:03:02,208 INFO [train.py:996] (3/4) Epoch 12, batch 29100, loss[loss=0.1683, simple_loss=0.2355, pruned_loss=0.05057, over 21516.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2912, pruned_loss=0.0703, over 4278810.22 frames. ], batch size: 230, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:03:28,392 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2187306.0, ans=0.0 2023-06-29 01:03:51,510 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2187366.0, ans=0.125 2023-06-29 01:03:54,775 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2187366.0, ans=0.125 2023-06-29 01:03:56,173 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2187366.0, ans=0.2 2023-06-29 01:04:01,346 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2187426.0, ans=0.0 2023-06-29 01:04:09,624 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2187426.0, ans=0.125 2023-06-29 01:04:10,240 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-29 01:04:22,732 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2187486.0, ans=0.0 2023-06-29 01:04:35,926 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2187486.0, ans=0.125 2023-06-29 01:04:38,536 INFO [train.py:996] (3/4) Epoch 12, batch 29150, loss[loss=0.2187, simple_loss=0.315, pruned_loss=0.06114, over 21820.00 frames. ], tot_loss[loss=0.2139, simple_loss=0.2901, pruned_loss=0.06884, over 4283471.69 frames. ], batch size: 316, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:04:59,725 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-29 01:05:30,924 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.044e+02 8.799e+02 1.298e+03 1.831e+03 4.569e+03, threshold=2.596e+03, percent-clipped=20.0 2023-06-29 01:05:37,904 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2187666.0, ans=0.0 2023-06-29 01:05:40,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-29 01:05:57,362 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2187786.0, ans=0.125 2023-06-29 01:06:14,115 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2187786.0, ans=0.0 2023-06-29 01:06:18,429 INFO [train.py:996] (3/4) Epoch 12, batch 29200, loss[loss=0.1958, simple_loss=0.2643, pruned_loss=0.06359, over 15149.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2856, pruned_loss=0.06743, over 4281903.70 frames. ], batch size: 60, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:06:18,944 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2187846.0, ans=0.0 2023-06-29 01:07:39,569 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.49 vs. limit=15.0 2023-06-29 01:07:53,408 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2188086.0, ans=0.1 2023-06-29 01:07:54,012 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2023-06-29 01:08:03,704 INFO [train.py:996] (3/4) Epoch 12, batch 29250, loss[loss=0.1948, simple_loss=0.251, pruned_loss=0.06928, over 20265.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2848, pruned_loss=0.0656, over 4281636.64 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:08:17,076 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2188146.0, ans=0.0 2023-06-29 01:08:30,253 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2188206.0, ans=0.025 2023-06-29 01:08:43,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2188206.0, ans=0.125 2023-06-29 01:08:53,906 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.020e+02 6.977e+02 9.878e+02 1.357e+03 4.006e+03, threshold=1.976e+03, percent-clipped=3.0 2023-06-29 01:08:57,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2188266.0, ans=0.5 2023-06-29 01:09:04,390 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2188326.0, ans=0.2 2023-06-29 01:09:28,709 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2188386.0, ans=0.1 2023-06-29 01:09:43,866 INFO [train.py:996] (3/4) Epoch 12, batch 29300, loss[loss=0.1826, simple_loss=0.2591, pruned_loss=0.05305, over 20712.00 frames. ], tot_loss[loss=0.208, simple_loss=0.2866, pruned_loss=0.06472, over 4287336.66 frames. ], batch size: 607, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:09:49,595 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-29 01:09:55,726 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-29 01:10:28,038 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2188566.0, ans=0.025 2023-06-29 01:11:05,468 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-29 01:11:30,196 INFO [train.py:996] (3/4) Epoch 12, batch 29350, loss[loss=0.2125, simple_loss=0.3077, pruned_loss=0.05865, over 21651.00 frames. ], tot_loss[loss=0.2054, simple_loss=0.2826, pruned_loss=0.06405, over 4279305.86 frames. ], batch size: 414, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:12:16,937 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 7.385e+02 1.115e+03 1.625e+03 3.431e+03, threshold=2.230e+03, percent-clipped=15.0 2023-06-29 01:12:29,364 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2188926.0, ans=0.0 2023-06-29 01:13:07,760 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-29 01:13:11,565 INFO [train.py:996] (3/4) Epoch 12, batch 29400, loss[loss=0.1736, simple_loss=0.2545, pruned_loss=0.04639, over 21737.00 frames. ], tot_loss[loss=0.2043, simple_loss=0.2832, pruned_loss=0.06268, over 4273215.25 frames. ], batch size: 332, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:14:52,513 INFO [train.py:996] (3/4) Epoch 12, batch 29450, loss[loss=0.207, simple_loss=0.2878, pruned_loss=0.06311, over 21629.00 frames. ], tot_loss[loss=0.2023, simple_loss=0.2805, pruned_loss=0.06201, over 4265841.08 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:15:44,177 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.913e+02 9.244e+02 1.482e+03 2.285e+03 4.603e+03, threshold=2.964e+03, percent-clipped=27.0 2023-06-29 01:16:00,921 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2189526.0, ans=0.1 2023-06-29 01:16:36,083 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2189586.0, ans=0.0 2023-06-29 01:16:38,750 INFO [train.py:996] (3/4) Epoch 12, batch 29500, loss[loss=0.2147, simple_loss=0.2868, pruned_loss=0.07131, over 21294.00 frames. ], tot_loss[loss=0.2073, simple_loss=0.2848, pruned_loss=0.06493, over 4272284.44 frames. ], batch size: 176, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:17:04,723 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2189706.0, ans=0.125 2023-06-29 01:17:27,470 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2189766.0, ans=0.125 2023-06-29 01:17:38,493 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2189826.0, ans=0.2 2023-06-29 01:17:55,978 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2189886.0, ans=0.0 2023-06-29 01:17:59,025 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2189886.0, ans=0.1 2023-06-29 01:18:18,332 INFO [train.py:996] (3/4) Epoch 12, batch 29550, loss[loss=0.2069, simple_loss=0.3132, pruned_loss=0.05025, over 19853.00 frames. ], tot_loss[loss=0.2093, simple_loss=0.2857, pruned_loss=0.06647, over 4279093.70 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:18:32,413 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2189946.0, ans=0.125 2023-06-29 01:18:34,328 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2190006.0, ans=0.1 2023-06-29 01:18:40,034 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=22.5 2023-06-29 01:18:52,863 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2190066.0, ans=0.2 2023-06-29 01:19:05,171 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.716e+02 8.394e+02 1.189e+03 1.876e+03 3.636e+03, threshold=2.379e+03, percent-clipped=6.0 2023-06-29 01:19:40,310 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2190186.0, ans=0.125 2023-06-29 01:20:00,973 INFO [train.py:996] (3/4) Epoch 12, batch 29600, loss[loss=0.3119, simple_loss=0.4005, pruned_loss=0.1116, over 21510.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2931, pruned_loss=0.06937, over 4286980.33 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:20:30,568 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2190306.0, ans=0.2 2023-06-29 01:21:41,168 INFO [train.py:996] (3/4) Epoch 12, batch 29650, loss[loss=0.1675, simple_loss=0.2415, pruned_loss=0.0467, over 21644.00 frames. ], tot_loss[loss=0.2102, simple_loss=0.2895, pruned_loss=0.06546, over 4284197.08 frames. ], batch size: 230, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:21:53,248 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2190546.0, ans=0.125 2023-06-29 01:22:33,824 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 4.536e+02 9.770e+02 1.872e+03 2.859e+03 6.209e+03, threshold=3.743e+03, percent-clipped=35.0 2023-06-29 01:23:05,243 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2190786.0, ans=0.125 2023-06-29 01:23:22,768 INFO [train.py:996] (3/4) Epoch 12, batch 29700, loss[loss=0.2162, simple_loss=0.2945, pruned_loss=0.06896, over 15856.00 frames. ], tot_loss[loss=0.2116, simple_loss=0.2912, pruned_loss=0.06599, over 4285910.08 frames. ], batch size: 60, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:23:44,551 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=2190906.0, ans=22.5 2023-06-29 01:23:46,252 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.89 vs. limit=10.0 2023-06-29 01:23:46,460 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.26 vs. limit=15.0 2023-06-29 01:23:50,500 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2190906.0, ans=0.0 2023-06-29 01:23:52,349 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2190906.0, ans=0.125 2023-06-29 01:24:06,523 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2190966.0, ans=0.2 2023-06-29 01:24:54,992 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2191086.0, ans=0.125 2023-06-29 01:25:02,357 INFO [train.py:996] (3/4) Epoch 12, batch 29750, loss[loss=0.2131, simple_loss=0.3074, pruned_loss=0.05937, over 21754.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.295, pruned_loss=0.06531, over 4273321.69 frames. ], batch size: 247, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:25:58,258 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.524e+02 7.715e+02 1.077e+03 1.535e+03 3.860e+03, threshold=2.154e+03, percent-clipped=1.0 2023-06-29 01:26:42,163 INFO [train.py:996] (3/4) Epoch 12, batch 29800, loss[loss=0.2059, simple_loss=0.2835, pruned_loss=0.06417, over 21501.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2964, pruned_loss=0.06648, over 4278039.58 frames. ], batch size: 212, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:27:22,842 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2191566.0, ans=0.1 2023-06-29 01:27:48,140 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2191626.0, ans=0.05 2023-06-29 01:28:15,101 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2191686.0, ans=0.125 2023-06-29 01:28:20,978 INFO [train.py:996] (3/4) Epoch 12, batch 29850, loss[loss=0.2069, simple_loss=0.2769, pruned_loss=0.06846, over 21792.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2909, pruned_loss=0.06427, over 4272425.62 frames. ], batch size: 112, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:28:40,673 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2191806.0, ans=0.1 2023-06-29 01:29:16,869 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.021e+02 7.804e+02 1.039e+03 1.669e+03 3.761e+03, threshold=2.078e+03, percent-clipped=15.0 2023-06-29 01:30:00,609 INFO [train.py:996] (3/4) Epoch 12, batch 29900, loss[loss=0.2278, simple_loss=0.2927, pruned_loss=0.0815, over 21496.00 frames. ], tot_loss[loss=0.2113, simple_loss=0.2911, pruned_loss=0.06571, over 4274322.29 frames. ], batch size: 211, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:30:30,082 INFO [scaling.py:1052] (3/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-29 01:31:04,513 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2192226.0, ans=0.125 2023-06-29 01:31:22,329 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2192226.0, ans=0.125 2023-06-29 01:31:32,407 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2192286.0, ans=0.125 2023-06-29 01:31:46,206 INFO [train.py:996] (3/4) Epoch 12, batch 29950, loss[loss=0.2291, simple_loss=0.3052, pruned_loss=0.07652, over 21675.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2946, pruned_loss=0.06948, over 4275487.65 frames. ], batch size: 351, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:31:54,085 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-29 01:31:55,247 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2192346.0, ans=0.0 2023-06-29 01:32:38,613 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.758e+02 9.924e+02 1.385e+03 1.832e+03 3.568e+03, threshold=2.770e+03, percent-clipped=22.0 2023-06-29 01:33:32,981 INFO [train.py:996] (3/4) Epoch 12, batch 30000, loss[loss=0.1956, simple_loss=0.3067, pruned_loss=0.04227, over 20766.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.297, pruned_loss=0.06958, over 4275218.25 frames. ], batch size: 608, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:33:32,982 INFO [train.py:1019] (3/4) Computing validation loss 2023-06-29 01:33:51,777 INFO [train.py:1028] (3/4) Epoch 12, validation: loss=0.255, simple_loss=0.3458, pruned_loss=0.08216, over 1796401.00 frames. 2023-06-29 01:33:51,778 INFO [train.py:1029] (3/4) Maximum memory allocated so far is 23690MB 2023-06-29 01:33:56,084 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2192646.0, ans=0.125 2023-06-29 01:33:58,256 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2192646.0, ans=0.125 2023-06-29 01:34:50,240 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2192766.0, ans=0.05 2023-06-29 01:34:53,433 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2192826.0, ans=0.0 2023-06-29 01:34:53,434 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2192826.0, ans=0.125 2023-06-29 01:35:38,187 INFO [train.py:996] (3/4) Epoch 12, batch 30050, loss[loss=0.2266, simple_loss=0.363, pruned_loss=0.04515, over 20713.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.3017, pruned_loss=0.06691, over 4272873.80 frames. ], batch size: 607, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:35:43,889 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2192946.0, ans=0.1 2023-06-29 01:36:03,387 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2193006.0, ans=0.125 2023-06-29 01:36:36,202 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.629e+02 9.060e+02 1.265e+03 2.367e+03 5.681e+03, threshold=2.530e+03, percent-clipped=16.0 2023-06-29 01:36:37,254 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-06-29 01:36:39,719 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2193126.0, ans=0.0 2023-06-29 01:36:55,614 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2193126.0, ans=0.0 2023-06-29 01:37:13,922 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-29 01:37:17,749 INFO [train.py:996] (3/4) Epoch 12, batch 30100, loss[loss=0.199, simple_loss=0.2608, pruned_loss=0.06861, over 21473.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2996, pruned_loss=0.06657, over 4268554.35 frames. ], batch size: 195, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:38:05,355 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=22.5 2023-06-29 01:38:07,270 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=22.5 2023-06-29 01:38:41,518 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2193486.0, ans=0.0 2023-06-29 01:39:01,687 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2193486.0, ans=0.125 2023-06-29 01:39:04,267 INFO [train.py:996] (3/4) Epoch 12, batch 30150, loss[loss=0.2268, simple_loss=0.3052, pruned_loss=0.07417, over 21131.00 frames. ], tot_loss[loss=0.215, simple_loss=0.295, pruned_loss=0.06746, over 4266404.61 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:39:11,440 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2193546.0, ans=0.2 2023-06-29 01:39:59,104 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2193666.0, ans=0.0 2023-06-29 01:40:05,026 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.363e+02 8.482e+02 1.272e+03 2.081e+03 3.656e+03, threshold=2.544e+03, percent-clipped=13.0 2023-06-29 01:40:33,914 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-29 01:40:47,522 INFO [train.py:996] (3/4) Epoch 12, batch 30200, loss[loss=0.2119, simple_loss=0.3351, pruned_loss=0.04434, over 21190.00 frames. ], tot_loss[loss=0.215, simple_loss=0.297, pruned_loss=0.06648, over 4260896.88 frames. ], batch size: 549, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:41:32,158 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2193906.0, ans=0.2 2023-06-29 01:41:52,716 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2194026.0, ans=0.0 2023-06-29 01:42:38,873 INFO [train.py:996] (3/4) Epoch 12, batch 30250, loss[loss=0.3478, simple_loss=0.4292, pruned_loss=0.1332, over 21472.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.3044, pruned_loss=0.06902, over 4262688.05 frames. ], batch size: 507, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:43:33,018 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.232e+02 7.981e+02 1.163e+03 1.576e+03 2.909e+03, threshold=2.325e+03, percent-clipped=5.0 2023-06-29 01:43:35,003 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2194266.0, ans=0.1 2023-06-29 01:43:40,119 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2194326.0, ans=0.125 2023-06-29 01:44:03,427 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-29 01:44:21,065 INFO [train.py:996] (3/4) Epoch 12, batch 30300, loss[loss=0.2029, simple_loss=0.2701, pruned_loss=0.06785, over 21863.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.3016, pruned_loss=0.06911, over 4257762.17 frames. ], batch size: 107, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:44:45,098 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2194506.0, ans=0.0 2023-06-29 01:44:50,209 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2194506.0, ans=0.125 2023-06-29 01:45:21,364 INFO [scaling.py:962] (3/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.86 vs. limit=5.0 2023-06-29 01:45:22,509 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-29 01:45:47,634 INFO [scaling.py:962] (3/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-29 01:46:05,222 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2194686.0, ans=0.125 2023-06-29 01:46:09,182 INFO [train.py:996] (3/4) Epoch 12, batch 30350, loss[loss=0.2768, simple_loss=0.3803, pruned_loss=0.08666, over 21685.00 frames. ], tot_loss[loss=0.221, simple_loss=0.3018, pruned_loss=0.07006, over 4257079.54 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 16.0 2023-06-29 01:46:09,799 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2194746.0, ans=0.0 2023-06-29 01:46:33,729 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2194806.0, ans=0.04949747468305833 2023-06-29 01:46:49,128 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.687e+02 9.388e+02 1.588e+03 2.178e+03 4.101e+03, threshold=3.176e+03, percent-clipped=21.0 2023-06-29 01:47:26,694 INFO [train.py:996] (3/4) Epoch 12, batch 30400, loss[loss=0.1999, simple_loss=0.2576, pruned_loss=0.07113, over 20232.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2979, pruned_loss=0.06963, over 4249863.39 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 32.0 2023-06-29 01:48:24,664 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2195226.0, ans=0.0 2023-06-29 01:48:50,489 INFO [train.py:996] (3/4) Epoch 12, batch 30450, loss[loss=0.263, simple_loss=0.379, pruned_loss=0.07347, over 19848.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2985, pruned_loss=0.06917, over 4192921.89 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 8.0 2023-06-29 01:49:38,085 INFO [optim.py:471] (3/4) Clipping_scale=2.0, grad-norm quartiles 5.733e+02 1.475e+03 2.498e+03 5.657e+03 1.532e+04, threshold=4.997e+03, percent-clipped=41.0 2023-06-29 01:49:44,861 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2195526.0, ans=0.0 2023-06-29 01:49:53,520 INFO [scaling.py:182] (3/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2195586.0, ans=0.0 2023-06-29 01:49:57,378 INFO [train.py:1249] (3/4) Done!